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Chapter  0 
Introduction 


Let  { Yt,teZ }  be  a  stationary  finitely  valued  stochastic  process  that  admits  a  rep¬ 
resentation  of  the  form  Yt  =  f{Xt)  where  {XuteZ}  is  a  finite  Markov  chain  and  / 
is  a  many-to-one  function.  We  call  such  a  process  a  Hidden  Markov  Chain  (HMC). 

Under  well  known  conditions  on  /  a  HMC  inherits  the  Markov  property  of  Xt 
and  becomes  a  finite  Markov  chain  itself,  but  this  case  is  non-generic.  In  general 
a  HMC  need  not  be  a  Markov  chain  of  any  finite  order  and  will  therefore  exhibit 
long-range  dependencies  of  some  kind.  This  fact  means  that  the  class  of  HMC  s  is 
a  very  rich  one  and  it  comes  to  no  surprise  that  it  is  extensively  present  in  many 
applications. 

We  can  find  HMC’s  appear  under  various  disguises  in  such  diverse  fields  as: 
engineering  (stochastic  automata,  speech  recognition),  biosciences  (in  ethology  to 
model  the  mating  behavior  of  some  species,  in  medicine  to  study  neurotransmis¬ 
sion),  economics  (stock  market  predictions),  and  many  others. 

On  the  theoretical  side  the  same  fact  (lack  of  the  Markov  property)  makes  the 
class  of  HMC’s  difficult  to  work  with.  The  general  methods  developed  for  the 
study  of  stationary  processes  apply  but  being  non-specific  they  will  not  give  the 
best  results.  Theoretical  work  on  the  specific  class  of  HMC’s  has  proceeded  along 
two  main  lines. 

The  early  contributions,  inspired  by  the  work  of  Blackwell  and  Koopman  (1957) 
[5],  concentrated  on  the  probabilistic  aspects.  The  basic  question  was  the  char¬ 
acterization  of  HMC’s.  More  specifically  the  problem  analyzed  was:  among  all 
finitely  valued  stationary  processes  Yt  characterize  those  that  admit  a  HMC  repre- 
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sentation.  This  problem  was  solved  by  Heller  [12]  in  1965.  To  some  extent  Heller’s 
result  is  not  quite  satisfactory  since  his  methods  are  non-constructive.  Even  if  Yt 
is  known  to  be  representable  as  a  HMC,  no  algorithm  has  been  devised  to  produce 
a  Markov  chain  X£  and  a  function  /  such  that  Yt  =  f(Xt)  or  at  least  ~  f{Xt) 

(i  e  they  have  the  same  laws).  In  recent  years  the  problem  has  attracted  the 
attention  of  workers  in  the  area  of  Stochastic  Realization  Theory,  and  while  some 
of  the  issues  have  been  clarified  a  constructive  algorithm  is  still  missing. 

•  The  first  contributions  dealing  with  statistical  aspects  were  made  in  the  late 
sixties.  Baum  and  Petrie  [4]  studied  maximum  likelihood  estimation  of  the  param¬ 
eters  of  a  HMC  proving  consistency  and  asymptotic  normality  of  the  MLE.  They 
also  provided  an  algorithm  for  the  numerical  computation  of  the  MLE  (of  course 
there  is  little  hope  for  an  explicit  solution  in  a  non-Markovian  setting)  basically 
inventing  the  EM  algorithm  that  became  popular  only  later  thanks  to  the  work  of 
Dempster,  Laird  and  Rubin  [8].  After  the  mid  seventies  HMC’s  made  only  spo¬ 
radic  appearances  in  the  statistical  literature.  In  1975  HMC’s  were  proposed  by 
Baker  [2]  as  models  for  automatic  speech  recognition  (ASR)  and  ever  since  they 
have  been  adopted  as  one  of  the  models  of  choice  in  this  field.  Computational 
aspects  became  very  important  and  much  work  was  done  on  the  implementation 
of  Baum’s  algorithm.  A  good  survey  of  this  area  of  research  is  [16]  which  also 

includes  an  extensive  bibliography. 

Although  much  work  has  been  dedicated  to  parameters  estimation  for  HMC  s 
only  very  recently  the  order  estimation  problem  received  some  attention.  The  order 
of  an  HMC  Yt  is  the  minimum  integer  q  for  which  there  exists  a  q- valued  Markov 
chain  Xt  such  that  Yt  =  f(Xt)  for  some  /.  The  knowledge  of  the  order  of  an 
observed  HMC  Yt  allows  the  construction  of  the  most  economical  representations 
f(Xt)  in  the  sense  that  the  number  of  parameters  (the  transition  probabilities  of 
Xt)  is  minimized.  The  order  cannot  be  estimated  using  the  classical  maximum 
likelihood  because  increasing  the  parameter  q  automatically  increases  the  likeli¬ 
hood.  This  is  the  typical  behavior  of  the  likelihood  function  when  the  parameter 
is  structural  i.e.  the  parameter  (usually  integer  valued)  indexes  the  complexity  of 
the  model.  As  another  example  of  structural  parameter  we  mention  the  order  of  a 
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Markov  chain  i.e.  the  smallest  integer  m  such  that: 

P(Xt  I  X* -1)  =  P(Xt  I  Xizl)  yt>m  +  1,  VXl. 

Again  the  maximum  likelihood  technique  fails  when  applied  to  the  estimation  of 
the  parameter  m. 

In  this  thesis  we  study  the  problem  of  order  estimation  for  Markov  chains  and 
hidden  Markov  chains.  The  technique  we  adopt  is  based  on  the  compensation  of 
the  likelihood  function.  A  penalty  term,  decreasing  in  q  (or  m),  is  added  to  the 
maximum  likelihood  and  the  resulting  compensated  likelihood  is  maximized  with 
respect  to  q  (or  m).  Proper  choice  of  the  penalty  term  allows  the  strongly  consistent 
estimation  of  the  structural  parameter.  Accurate  information  on  the  almost  sure 
asymptotic  behavior  of  the  maximum  likelihood  is  of  critical  importance  for  the 
correct  choice  of  the  penalty  term  and  the  Law  of  the  Iterated  Logarithm  (LIL)  is 
therefore  the  best  tool  for  this  study. 

The  technique  that  we  have  just  (roughly)  described  and  the  same  probabilistic 
tools  have  been  used  for  the  estimation  of  the  structural  parameters  of  ARMA 
processes  (see  e.g.  [1],  [11]),  but  we  are  not  aware  of  any  previous  work  that 
employs  this  approach  for  Markov  chains  or  hidden  Markov  chains. 

We  conclude  the  introduction  with  a  brief  summary  of  the  thesis.  In  Chapter 
1  we  formally  define  HMC’s  and  collect  some  basic  results  that  will  be  used  in  the 
sequel.  Most  of  these  results  are  available  in  the  literature  to  which  we  refer  the 
reader  for  a  more  detailed  treatment.  Chapter  2  is  dedicated  to  the  proof  of  the 
consistency  of  the  maximum  likelihood  estimator  (MLE)  of  the  parameters  of  a 
HMC.  The  novelty  with  respect  to  Baum  and  Petrie  [4],  where  consistency  was 
first  proved,  is  that  we  do  not  require  the  observed  process  to  be  a  HMC.  The 
main  results  of  this  chapter  are  new.  In  Chapter  3  we  deal  with  the  estimation 
of  the  order  of  a  Markov  chain.  First  we  use  the  LIL  to  find  delicate  bounds  on 
the  asymptotic  behavior  of  the  maximum  likelihood  estimator  and  then  use  the 
bounds  to  construct  strongly  consistent  estimators  of  the  order.  The  mam  results 
of  Chapter  3  are  new,  moreover  the  chapter  is  practically  self-contained.  The  final 
Chapter  4  is  dedicated  to  the  estimation  of  the  order  of  HMC’s.  The  behavior 
of  the  maximum  likelihood  is  difficult  to  evaluate  because  no  explicit  expressions 
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for  the  estimators  are  available.  The  LIL  works  for  one  special  case,  but  we  must 
use  other  methods  to  evaluate  the  asymptotics.  We  obtain  some  results  with 
an  approximation  technique  that  uses  Markov  chains  of  increasmg  order  m  to 
approximate  the  given  HMC.  These  results  are  too  weak  to  solve  the  pro  em 
of  order  determination  and  we  will  have  to  resort  to  a  result  from  Informat.on 
Theory  to  get  the  necessary  asymptotics  of  the  maximum  likelihood.  Apart  from 
this  heavy  dependence  from  Information  Theory  the  rest  of  Chapter  4  is  new. 
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Chapter  1 

Hidden  Markov  Chains 


In  Section  1  we  define  formally  HMC’s  adopting  the  elegant  formalism  originated 
in  Stochastic  Realization  Theory,  show  its  equivalence  to  the  definition  given  in 
the  introduction  and  present  some  useful  formulas  for  the  computation  of  probabil¬ 
ities  of  interest.  In  Section  2  we  review  some  results  from  Realization  Theory  that 
demonstrate  the  elusive  nature  of  the  notion  of  minimality  for  HMC’s.  We  will 
attempt  (with  modest  success)  to  circumvent  the  difficulty  introducing  the  class  of 
regular  HMC’s.  In  Section  3  we  give  two  results  on  the  equivalence  of  representa¬ 
tions  They  are  not  the  best  available  but  will  be  enough  for  our  purposes.  In  the 
final  section  we  define  parametric  families  of  HMC’s  that  will  be  used  later  and  a 
sufficient  condition  for  their  identifiability  is  given.  The  results  presented  m  this 
chapter  are  scattered  in  the  literature,  our  main  goal  was  to  bring  them  together 
as  coherently  as  possible. 

1.1  HMC  fundamentals 

There  are  many  equivalent  ways  of  defining  HMC’s.  We  particularly  like  the 
definition  that  originated  in  Realization  Theory  [23]  and  we  will  borrow  it. 

Definition  1.1.1  (SFSS) 

A  pair  {Xt,  Yt  teN}  of  stochastic  processes  defined  on  a  probability  space  (  ,  ,  ) 

and  taking  values  in  the  finite  setXxy  is  said  to  be  a  stationary  finite  stochastic 
system  (SFSS)  if  the  following  conditions  are  met: 
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i)  (Xt,  Yt)  are  jointly  stationary 

ii)  P(Yt+i  =  yt+ u  Xt+i  =  *t+ 1  I  V  =  yl  Xi  =  ^ 

=  P(Yt+x  =  yt+ 1,  ^t+1  =  *t+i  I  =  Zt) 

Tfte  processes  Xt  and  Yt  are  called  respectively  the  state  and  the  output  of  the  SFSS. 
The  cardinality  of  X  will  be  called  the  size  of  the  SFSS. 


Definition  1.1.2  (HMC) 

A  stochastic  process  Yt  with  value  in  the  finite  set  y  is  a 
(HMC)  if  it  is  equivalent  to  the  output  of  a  SFSS. 


Hidden  Markov  Chain 


Recall  that  two  stochastic  processes  are  said  to  be  equivalent  if  their 
coincide.  Definition  1.1.2  has  therefore  to  be  interpreted  as  follows:  the  process 
Yt  is  a  HMC  if  its  probability  distribution  function  Py{Vi)  •—  y 1 

bl  represented  as  *«)  =  W  =  Vt)  »"ere  fi  takes  value  in  y  and  is  the 
output  of  a  SFSS.  Observe  that  we  do  not  require  Y,  to  be  defined  on  t  e  same 
probability  space  (ft,  F,  P)  as  Y„  they  can  be  completely  different  objects  ut 
they  are  indistinguishable  from  observation.  From  now  on  when  we  refer  to  F,  as 
a  HMC  we  will  actually  refer  to  any  process  Y,  in  the  same  equivalence  class.  We 
will  refer  to  any  SFSS  (X„  ?,)  with  Y,  equivalent  to  Y,  as  a  representation  o 

the  HMC  Yt. 


Example  (adapted  from  Ornstein  [20]) 

A  box  contains  a  roulette  wheel  We  look  at  ah  possibilities  for  two  consecutive 
spins  of  the  wheel  and  divide  these  into  two  classes.  Each  time  we  spin  the  wheel 
(the  spins  are  independent),  we  look  at  the  last  two  spins  and  print  out  Off  they 
fall  in  the  first  class  and  1  if  the  fall  the  second  class.  The  output  of  the  box  is  a 
HMC.  Observe  that  the  probability  of  printing  a  1  at  time  n  cannot  be  determined 
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from  the  observation  of  the  previous  values  (unless  the  two  classes  have  trivial 
configurations),  i.e.,  the  output  need  not  be  Markov  of  any  order  m. 


In  the  introduction  we  referred  to  HMC’s  as  stationary  processes  of  the  form  Y, 

=  /(X,)  where  X,  is  a  stationary  Markov  Chain,  but  this  is  equivalent  to  definition 
1  1  1  Clearly  if  Y.  =  f(X,)  with  X,  stationary  Markov,  the  parr  (X„  Y,)  will  be 
a  SFSS  and  Y,  a  HMC  according  to  1.1.2.  Let  now  Y,  be  a  HMC  according  to 
1.1.2  and  X,  be  the  state  process  of  a  SFSS  associated  with  Y„  If  we  sum  l.l.ln 

over  t/t+i  we  get  F(X,+I  =  x.+i  I  X\  =  **„  Y‘  =  y[j  =  P(^.+i  =  *<+i  I  ^ 
and  after  taking  conditional  expectations  with  respect  to  X*  we  have  P(X,+ 1  - 
Il+1  |  X!  =  x\)  =  P(X1+,  =  *1+1  I  X,  =  *,).  Therefore  X,  is  a  Markov  Cham. 
As  a  direct  consequence  of  l.l.lii  we  also  have  that  the  process  S,  =  (Xt,  Y,)  is  a 
Markov  Chain.  Taking  /  :  X  x  y  -  to  be  the  projection  map  on  the  second 
component  i.e.,  /(*,»)  =  V  we  get  the  representation  X,  =  f(S<)  as  desired. 

We  insisted  on  the  fact  that  in  general  HMC’s  do  not  have  finite  memory, 
nevertheless  their  laws  are  completely  specified  by  a  finite  number  of  parameters.  In 
fact  to  specify  the  laws  of  a  SFSS  it  is  sufficient  to  specify  the  finite  set  of  matrices 
{M(y),yey}  whose  elements  are:  m,,(y)  :=  P(Ti+i  =  V,Xt+i  =  j  I  X,  -  i) 
ij-_l’,2,...|X|.  Observe  that  the  matrix  A  :=  is  the  transition  matrix 

of  the  Markov  Chain  X,.  If  to  the  matrices  M(y)  we  add  an  initial  distribution 
vector  x  such  that  *  =  irA  (stationarity)  then  we  have  a  complete  specification  of 
the  laws  of  the  SFSS.  Very  often  in  the  literature  the  following  “factorization 

hypothesis  is  made: 


F(V.+i  =  »,X+i  =j\X,  =  i)  =  P(YW  =  y  I  *.+i  =  i)P(X,»  =  j\  Xt-i) 

Since  the  factorization  hypothesis  always  holds  for  the  process  S,  =  (X„  It)  we 
will  assume  it  without  loss  of  generality  (later  we  will  impose  conditions  on  the 
values  and  therefore  the  assumption  will  become  restrictive). 

Let  fc,  :=  P(Y,  =  v  I  X,  =  i),  B  the  |  X  \  X  |  y\  matrix  of  the  biy%  and  B,  := 
diag (™tere  c  :=  |  X  |).  The  factorization  hypothesis  now  gives: 

M(y)  —  ABy. 
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We  conclude  this  section  with  a  collection  of  formulas.  The  formulas  for  the 
probability  of  some  cylinders  in  terms  of  the  parameters  are  well  known  and  will  be 
used  later.  The  other  formulas  are  obtained  by  elementary  algebraic  manipulations 
from  1.1.1m.  We  present  them  here  because  they  shed  some  light  on  the  nature  of 
the  dependencies  between  the  parameters  of  SFSS. 

In  the  sequel  when  notationally  convenient  and  not  misleading  we  will  identify 
random  variables  with  their  values  e.g.  P(yt)  will  mean  P{Yt  =  yt)- 

Lemma  1.1.1 

P{y?+i\Xt  =  i)  =  efM(y£+1)---M(yn)e  Vn  >  i  +  1 

P{y\,Xt  =  i)  =  *M(yi)--M(yt)ei 
P(yJ)  =  wM(yi)  •  •  ■  M(yn)e 

where  e{  is  the  i-th  standard  basis  vector  of  1 Zc  and  e  =  £  ei 

Proof:  everything  follows  directly  from  1.1.1m. 

□ 


For  future  refereuce  let  us  introduce  the  following  definitions.  For  any  word 
v  =  yiy-2  •••yk  define: 

M{v)  :=  M(y1)M(y2)-*-M(yfc)  a  square  matrix  of  size  \X\. 

g(v)  :=  7 vM(v)  a  row  vector  of  size  [  X  |  . 
h(v)  :=  M(v)e  a  column  vector  of  size  |  X  |  . 

With  4 >  we  denote  the  null  word  and  define  g{<j>)  :=  x,h{4>)  ~  e. 

The  formula  for  the  output  probability  can  be  written  is  scalar  form  as  follows 


j>W)  =  E  W,  *J)  =  E  I 


Given  the  sequence  j/J  is  independent  in  fact: 


P{vt  I  *?) 


P(  yr.x?) 

P(Xl) 

p(ynixn\xn^l)P{xn1-\yrl) 

P(xn  |  Xn-i)P{Xi~l) 

P{y n  |  xn)P(yrl  I  xi-1) 
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and  proceeding  by  induction  we  get  P(y*  |  z")  —  ITfc=i  P{Vk  !  xk) 

Therefore: 

n 

P{vi)  =  ^P{xi)P(yi  I 20  II  -Pl^fc+i  I  I  Xk+7 

xn  k—2 

=  ^  7Tri  6xm  ailH  ^X23/2  ar2X3  '  '  '  axn-\Xn  bznVn 

xn 

For  the  final  part  of  the  section  we  make  the  extra  assumption  that  all  proba¬ 
bilities  are  strictly  positive  (it  is  actually  enough  to  assume  M{y)  >  0). 

Lemma  1.1.2  The  following  conditions  are  equivalent  (1  <  t  <  n): 


i)  P(Xt+\,Yt+i  |  —  P{Xt+iiYt+\  I  Xt) 

ii)  P(Xt-uYt-i  |  X?,Ytn)  =  P{Xt-i,Yt-x  |  Xt). 

Proof:  Since  the  process  (XuYt)  is  Markov  it  is  sufficient  to  check  that 

P{Xt- 1,  Yt-x  |  Xu  Yt)  =  P{Xt- 1,  Yt-x  |  Xt). 


But 

P{Xt-x,Yt-x\Xt,Yt) 


PjXuYt  |  Xt-x,Yt-x)P{Xt-x,Yt-x) 

-  P(xt,Y7) 

p[xt  i 

P(Xt) 

P{Yt-i  I  X*-i)P(Xt-i  I  *0- 


This  last  expression  is  independent  from  Yt  and  therefore  equals 

P{Xt-x ,  Yt-x  I  Xt). 


□ 


Lemma  1.1.3  For  (1  <  t  <  n)  : 

i)  P{Xt\X!r\Y?)=P{Xt\Xt-uY?) 

ii)  p(xt  |  xtn+1,yxn)  =  P{xt  |  xt+xX) 
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Proof:  First  we  will  prove  i. 
The  LHS  is: 

p(x‘,nn) 


P(Y.l,  \X‘„V)P{XIYJ) 


p(x i=W)  "  p(y,"  i  xr‘,n’“‘)-p(^r‘.yr‘) 

-  WpS’W‘|X*'l) 

since  the  last  expression  does  not  depend  on  X[~ 2  it  must  coincide  with 

P(Xt  |  Xt- 1,  nn).  Now  we  must  prove  that  P(Xt  1  Xt- \,Y*)  =  P{Xt  |  Xt-1,Y1  ). 

But 

P(Y?,Xt,Xt- 1) 

P{Y?,Xt-i) 

P(Y?,Xt\Yt-\Xt-1)P{Yi-1,X*.  x) 

p(y«n,*t 

P(i?  |  xt-j) 

=  p(xt|Xt_!,ytn) 


P{Xt  \Xt-nY?)  = 


This  proves  i).  For  ii)  use  the  same  technique  and  Lemma  1.1.2. 


□ 


We  next  use  Lemma  1.1.3  to  find  an  expression  for  the  posterior  probability  of 
the  state  process  Xt  as  a  function  of  the  filter  and  the  one-step  predictor. 

Lemma  1.1.4 

JWim-ij  p{x^  f8)  p(X,+l|Xl) 

Proof:  Requires  a  little  manipulation  using  1.1.3«  and  the  easily  proved  fact 
that  P(Y±  |  Xt,Xt+i)  =  P{Yj  |  -Xt)- 


Again  by  manipulating  the  formulas  and  with  the  help  of  Lemma  1.1.3  we  find 
that: 

P(Y„X,  |  Yt-\xr\X?+l,Y^)  =  P(Yt  |  X,),P(X,  |  Xt-x,XM) 

which  gives  us  the  structure  of  the  neighborhood  system  for  the  Markov  random 
field  (Xt,Yt). 
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1.2  Results  from  Realization  Theory 

In  [12]  Heller  characterized  the  finite  valued  stationary  processes  Yt  that  are  HMC’s. 
We  need  some  preliminaries  to  present  his  results,  y  will  denote  a  finite  set,  >' 
the  set  of  finite  words  from  y,  and  C*  the  set  of  probability  distributions  on  y. 
C*  is  convex.  A  convex  subset  C  C  C*  is  polyhedral  if  C  =  conv  { <Zi( •)>  ■ '  ‘  i  ?<;(■)} 
i.e.  C  is  generated  by  finitely  many  distributions  qi(-)eCM.  A  convex  polyhedral 
subset  C  C  C*  is  stable  if  C  -  conv  {<?i(-),  •  •  •  ,&(•)}  and  for  1  <  t  <  c  and  Vyey 
the  conditional  distributions  qt(-  |  y)  :=  We  are  now  ready  to  enunciate 

the  main  result. 

Theorem  1.2.1  (Heller  [12]) 

PY(-)  is  the  pdf  of  a  HMC  iff  the  set 

Cy  :=  conv{Py( ■  |  u)  ueT*} 

is  contained  in  a  polyhedral  stable  subset  of  C  . 

a 

For  an  elementary  and  insightful  proof  of  Heller’s  theorem,  see  Picci  [23]  which 
we  followed  for  the  presentation  of  the  result.  Suppose  that  a  given  Py{-)  satisfies 
the  conditions  of  Heller’s  theorem  and  that  the  Yt  process  is  therefore  a  HMC.  Two 
questions  now  arise  naturally.  We  called  any  SFSS  ( Xt,Yt )  with  Yt  equivalent  to 
Yt  a  representation  of  the  HMC  Yt  and  showed  that  the  distributions  of  a  SFSS 
are  completely  specified  by  the  set  of  parameters  M  :=  {c,M{y),ir}  where  c  = 

|  x  ].  It  is  therefore  natural  to  identify  a  representation  of  the  HMC  Yt  with  the 
set  M.  When  clear  from  the  context  we  will  omit  c  from  the  list  of  parameters. 

The  first  question  is:  can  the  parameters  of  a  representation  be  determined 
directly  from  Py(-)? 

Such  a  representation  of  Yt  is  inherently  non-unique  and  we  would  like  to  find 
the  “simplest”  one.  Take  |  *  |  as  a  measure  of  complexity,  and  for  a  given  HMC 
Yt  define  its  order  as  the  minimum  of  |  X  |  among  all  representations.  A  repre¬ 
sentation  for  which  |  X  |  equals  the  order  is  said  to  be  a  minimal  representation. 

The  second  question  is:  can  the  order  be  determined? 
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At  present  the  answer  to  both  questions  is  in  the  negative,  but  there  are  a  few 
clues.  Unless  otherwise  noted  the  following  is  derived  from  the  works  of  Gilbert 
[10],  Carlyle  [6]  and  Paz  [21].  Let  p(-)  be  an  arbitrary  pdf  (not  necessarily  HMC), 
and  ux  •  •  •  un  v[  ■  •  •  <  2n  arbitrary  words  from  y.  The  compound  sequence  matrix 
(c.s.m.)  P{vx  ■  ■  -Un,  v[  ■  ■■v'n)  is  the  n  x  n  matrix  with  ij  element  p(viv'jf  The 
rank  of  p(-)  is  defined  as  the  maximum  of  the  ranks  of  all  possible  c.s.m.  if  such 
maximum  exists  or  +oo  otherwise.  Suppose  now  that  p(-)  is  the  pdf  of  a  HMC 
which  admits  a  representation  M  :=  of  size  c.  Using  the  definitions 

following  Lemma  1.1.1  we  have  that:  P{ vx  •  •  •  vnvx  ■  •  •  vn)  =  G{vx  ■  ■  ■  vn)H{vl  vn ) 
where  G,H  are  n  x  c  and  c  x  n  matrices  respectively.  The  i-th  row  of  G  is  g(vi) 
and  the  j-th  column  of  H  is  h{v])  in  fact:  p^v])  =  xM{ Viv])e  =  rr M{vf)M{v2)e 

=  9{vi)h{y]). 

It  clearly  fohows  that  the  rank  of  a  HMC  cannot  exceed  the  size  of  any  of  its 
representations  and  therefore  in  particular: 

Lemma  1.2.1  The  rank  of  a  HMC  is  a  lower  bound  to  its  order. 


Remark:  The  concept  of  rank  of  a  pdf  is  only  loosely  related  to  the  HMC 
property  because  there  are  examples  of  pdf's  with  finite  rank  that  are  not  HMC. 
Also  there  are  examples  of  HMC’s  whose  order  is  strictly  greater  than  their  rank. 

A  representation  M  =  {c,  ATQ,*}  of  size  c  is  regular  if  the  rank  of  the  corre¬ 
sponding  pdf  equals  c.  Regular  representations  are  minimal  as  a  direct  consequence 
of  Lemma  1.2.1.  As  explained  in  the  remark  not  all  HMC’s  admit  regular  repre¬ 
sentations,  but  the  following  two  results  will  justify  our  interest  in  them.  The  first 
result  states  that  it  is  “easy”  to  check  regularity. 


Lemma  1.2.2  A  finite  number  of  operations  is  sufficient  to  determine  the  regu 
larity  of  a  given  representation  M  =  {c,  M(-),  tt}. 


Proof:  Let  c  be  the  size  of  M.  We  must  check  that  there  exist  2c  words 

vl---vcv[---v'c  such  that  the  c.s.m.  of  size  c:  G(ui  •••<>=)  H(vi  ’  *  *  u=)  1S  mvertlble- 
This  is  equivalent  to  checking  the  invertibility  of  both  G  and  H.  To  complete  the 
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proof  it  is  sufficient  to  show  that  rank  Gfa  •  ■ ■  vc)  attains  its  maximum  for  at  least 
one  set  of  words  (v,  ■  •  •  vc)  with  1  «,•  |<  c  (1  <  i  <  c)  (and  similarly  for  H(v[  ■  ■  ■  v'c)). 

Let  Lk  :=  span{g(v)  veT,\  v  |=  *},  Lk  is  a  linear  subspace  of  the  vector 
space  1ZC.  We  will  show  that  for  k  >  c  all  subspaces  Lk  are  identical.  Since 
gtv\  —  £  g(vy)  we  have  that  Lk  C  Lk+ 1,  and  since  g(vy)  -  g{y)M(y)  we  ha\e 
that  Lk  =  Lfc+i  =*  Lk+m  =  Lk  Vm  >  1.  Let  J  be  the  first  integer  for  which 
Lj  =  LJ+1  then  dimX0  +  J  <  dim  Lj  <c  and  we  conclude  that  J  <  c. 

The  proof  for  H  is  analogous. 


The  second  result  states  that  almost  all  representations  are  regular.  Let  T  be 
the  set  of  all  M  :=  of  size  c.  T  is  a  compact  set  in  K  for  some  k 

depending  on  c. 

Lemma  1.2.3  The  non-regular  elements  ofT  are  a  closed  subset  of  %k -Lebesgue 
measure  zero. 

Proof:  The  non-regular  points  of  V  are  characterized  by  the  vanishing  of  the 
determinant  of  a  finite  number  of  matrices. 


We  conclude  this  section  with  an  observation  on  the  structure  of  the  pdf  of 
HMC’s. 


Lemma  1.2.4  IfYt  is  a  EMC  known  to  admit  a  representation  of  size  c  then  its 
pdf  p(')  is  completely  determined  by  the  values  {p(u),|  v  |<  2c}. 


Proof:  The  existence  of  a  representation  of  size  c  implies  that  the  rank  of  p(-) 
cannot  exceed  c.  Let  r  be  the  rank  of  p(-)  and  P  an  r  xr  c.s.m.  ofrankr.  Such  a  P 
exists,  moreover  the  proof  of  Lemma  1.2.2  shows  that  the  words  vx  ■  • '  «r ,  «i  ‘  ^ 
defining  P  can  be  chosen  of  length  <  c.  Let  «,«y  arbitrary,  since  P  is  invertible 
we  have  that:  [p( ™i)  •  •  •  p(Wr)]  =  a{rv)P  for  some  row  vector  «(«)  which  only 
depends  on  w  and  P.  Construct  the  c.s.m.  P{vu  ■  ■  •  «r,  »>,  <  1  ‘  ^ 
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Since  P  is  of  maximal  rank,  the  rank  of  P  must  also  be  r  and  therefore: 
p(w)  =  [p(u.u;)---p(um;)ii3-1b(».)---pMT 

=  a(ui)[p(uO  -  •  ■  p(fr)]T 


We  gave  the  result  in  this  form  since  it  can  easily  be  proved  from  what  we 
already  presented.  The  best  possible  bound  on  the  length  of  the  words  determining 
p(-)  completely  is  Ic-l  (see  Paz[21]).  In  Carlyle  [6]  a  recursive  algorithm  is  given 
for  the  computation  of  p(-)  of  long  strings  starting  from  {?(»),  M<2c-1}. 
Remark:  It  is  always  possible  to  construct  a  c.s.m.  P  of  maximum  rank  taking 
Vl  =  =  4,  this  can  be  seen  taking  to  =  9  in  the  proof  of  Lemma  1.2.4.  Expanding 

the  determinant  of  P  along  the  last  row  and  along  the  last  column  and  comparing 
we  get  the  result. 


1.3  Equivalent  representations 

In  this  section  we  study  conditions  for  the  equivalence  of  representations.  Our 
results  are  not  the  best  possible  (see  [13])  but  they  are  relatively  straightforward. 
The  following  is  a  sufficient  condition  for  equivalence. 


Lemma  1.3.1  Let  M  :=  and  M  :=  {£,*(),»}•  IfX.Y  arecxc 

and  c  x  c  matrices  respectively  such  that: 


M(y )  =  YM(y)X  'iyty 
ir  =  rX 
e  =  7e 
XY  =  Ic 


then  M  and  M  are  equivalent. 


Proof:  It  is  sufficient  to  verify  that  for  an  arbitrary  word  way  - 

7 vM(w)e.  This  follows  immediately  by  substitution. 

□ 

Remark:  The  condition  XY  =  Ic  implies  rank  X  >  c  and  rank  Y  >  c,  therefore 
the  lemma  is  non-trivial  only  for  c>  c. 

If  one  of  the  representations  is  regular  we  can  give  a  necessary  condition  for 
equivalence  as  follows. 

Lemma  1.3.2  Let  M  :=  {c,M(-),*}  and  M  :=  {*,/*(•).*}  «»<*  M  to 

be  regular.  If  M  is  equivalent  to  M  then  c>c  and  there  exist  X,Y  of  dimensions 

c  x  c,  c  x  c  respectively  such  that: 

M(y)  =  XM{y)Y 
7T  =  rY 
e  =  Xe 
XY  =  Ic 


Proof:  The  necessity  of  5  >  c  follows  from  the  minimality  of  regular  representa¬ 
tions.  We  will  exhibit  a  pair  of  matrices  X,Y  satisfying  the  conditions. 

Since  M  is  regular  there  exists  an  invertible  c.s.m.  P{y  l  -■■vc,v[-  •  •<)■ 
the  last  remark  of  Section  1.2  we  can  always  select  Vl  =  v[  =  <p.  Therefore 
t*  ••  •  Vc)  =  G{<j>,  v2  •  ■ •  •  ve)H{4>,  v’2  •  •  •  v'J  where  both  G  and  H  are 

invertible.  Observe  that  the  first  row  of  (7  is  tt  and  the  first  column  of  H  is  e. 
Since  M  and  M  are  equivalent  they  have  the  same  c.s.m.  In  particular  this  means 

that: 

a)  v2  •  •  •  vc)&{h  v'2  •  •  •  v'c)  =  G{<f>,  v2  •  •  *  vc)H{4>,  v2---  v'c) 

b)  G(<f>,  v2  •  •  -  vc)M{y)m  <4 •*•<)  =  G{t,  V2- ■  VcW (y)H(<f>,  v'2  • ■  ■  v')  (Vye^) 

Observe  that  G,  and  H  are  c  x  c  and  c  x  c  respectively  and  each  of  rank  c  since 
their  product  is  of  rank  c. 
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Define  X  :=  G~'G  and  Y  :=  HH~'.  Then  by  b)  M(y)  =  XM(y)Y.  To  verify 
that  .  =  W  observe  that  x  is  the  first  row  of  G  therefore  fiV'  =  (first  row  of 
=  first  row  of  (GH)H~'=  first  row  of  (Gff)iT1  =  *•  Analogously  it 
can  be  proved  that  e  =  Xe.  Finally  XY  =  G-1GHH~l  =  G  1 GHH  '  -  U 


1.4  Families  of  HMC’s 

In  this  section  we  introduce  the  families  of  HMC's  that  will  he  used  as  model 
classes  in  the  following  chapters. 

From  now  on  y  will  be  a  fixed  finite  set  with  |  ^  1=  r-  The  family  6  of  a11 
HMC’s  of  all  orders  (taking  values  in  can  be  identified  with  the  family  of  all 
9  -.=  {c9,Mg{y),ve}  with  ceN.  For  9eQ  define  Ps(jtf)  '•=  x eMg(yx)eC9  (we  will 
often  drop  the  subscripts  in  the  RHS  and  simply  write  Pa{yl)  =  rM(y^)e).  Define 
Qq  :=  {OeO-,c9  =  q}. 

Lemma  1.4.1 

V  qWeQq  39eQq+l  such  that  Pg{-)  =  P*(0  or,  abusing  the  notation,  Qq  C  «,+!• 

Proof:  Let  9  =  {q,M{y),  tt}  and  construct  9  as  follows: 

q  =  q  +  1,  and  M{y)  :=  diag{M{y),rh(y)}, 

where  m(y)e[0,l]  and  =  A  is  a  stochastic  matrix,  *  =  (x,0).  It  is 

immediate  to  verify  that  Pg(-)  —  Pe{’)- 


Statisticians  refer  to  famihes  satisfying  Lemma  1.4.1  as  nested  families.  A  few 
considerations  about  the  identifiability  of  0  are  now  in  order.  A  point  9e@,  is 
identifiable  in  9,  if  for  any  9’  +  9(9'ee,)P,(-)  *  AO  S*  **  at  least  one  word 
„  P„(w)  ±  A( w).  This  definition  is  too  strong  and  it  would  give  no  identifiable 
points  in  any  0,  In  fact  for  a  given  9  at  least  the  (finitely  many)  points  9'  obtained 
by  permutations  of  the  rows  and  columns  of  M(y)  and  x  give  A(0  «(')• 
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win  say  that  flee,  is  identifiable  modulo  permutations  (i.m.p.)  if  the  only  points 
9'60,  with  P, (•)  =  P„(')  are  obtained  by  permutation  as  described  above.  Regular 
points  flc6,  (i.e.  points  for  which  rank  P,  =  q)  are  good  candidates  for  being  i.m.p. 
but  a  few  (mild)  extra  conditions  must  be  added.  In  [22]  Petrie  proves  a  theorem 
on  identifiability  that  we  will  adapt  to  our  case. 


Definition  1.4.1 

9  =  {q,M(y),ir}  is  a  Petrie  point  if 

9  is  regular 

M{y )  is  invertible  Vy 

3 yey  such  that  biy,  (i  =  1, 2,  •  •  •  q),  are  distinct. 


Theorem  1.4.1  (Petrie  [22]  adapted) 

The  Petrie  points  ofQq  are  identifiable  modulo  permutation. 


Proof:  Let  9tQq  be  a  Petrie  point.  We  show  that  if  9eQq  and  Pfi-)  ,  P3{  ) 
then  9  and  9  differ  by  a  permutation.  By  regularity  there  exists  vl---vq  v1-  vq 

such  that  P(v r  •  •  ■  v,  ■  •  •  <  =  G(». ' ' '  »,)*  K  ' ' '  O  is  “Vertib‘e  tOSe‘her 

(?(„,...  „,)  and  H(v[  ■■•<).  Since  PjO  =  P,  (•)  their  c.s.m.’s  are  identical  and 


therefore: 

(1)  G{v1---vq)H(v[---v'q)  =  G{v1~-v9)H{v,1---v’q). 

We  conclude  that  G  and  H  are  also  invertible.  (For  convenience  we  dropped 
Analogously  Vye(F  GM(v)S  =  GM(y)H  (since  these  too  are 

c.s.m.’s),  from  which  we  have:  M(y)  =  G''GM(y)HH-\  From  (1)  HH - 
G~lG.  Let  X  :=  G_1G  (invertible)  then: 

(2)  M(y)  =  X-lM{y)X. 

To  conclude  it  is  enough  to  prove  that  X  is  a  premutation  matrix.  Towarithis 
end  sum  (2)  over  y  and  get  A  =  X^AX,  substituting  into  (2)  we  have  AB,  - 
X~lAXBy  =  X-lAB,X.  Since  M(y)  =  AB,  is  invertible  (9  is  a  Petne  point)  so 
must  be  A.  We  finally  have:  ByX  =  XBy.  Let  y.ey  be  such  that  bly.  i  =  1  •  •  -  9 
are  distinct.  From  By.xy  =  XBy  we  see  that  the  j-th  column  of  X  sat.sfies: 
BKxj  =  bis,x y.  Since  By.  is  diagonal  with  distinct  elements,  x,  can  only  be  one 
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of  the  standard  basis  vectors  e„  e,  -  e,.  This  means  that  there  is  at  most  a  1  in 
each  column  of  X.  Observing  that  Xe  =  e  concludes  the  proof. 


Lemma  1.4.2 

The  set  of  Petrie  points  is  open  and  of  full  Lebesgut  measure  in  Qq. 

Proof:  We  already  proved  this  for  regular  points  in  Lemma  1.2.3.  The  added 
hypotheses  can  be  dealt  with  in  the  same  way. 

□ 

It  will  often  be  convenient  to  somewhat  restrict  the  family  0  in  order  to  simplify 
statistical  considerations. 

Definition  1.4.2 
For  0  <  §  <  )  define: 

©;  :=  {9eOq;  ay  >  8,  bjy  >  8,  Vi,  j,y}. 

□ 

Remark:  If  9eQsq  the  stochastic  matrix  A*  is  certainly  irreducible  and  aperiodic 
and  its  invariant  vector  is  uniquely  determined.  In  this  case  9  is  completely 
specified  by  {M(y)}  or  by  {A, B}. 

The  following  lemma  is  simple  but  it  will  be  essential  later.  With  the  abuse  of 
notation  introduced  in  Lemma  1.4.1  we  have: 

Lemma  1.4.3 

0*  C  05/2 

Proof:  Let  6ee‘.  Define  X,Y  matrices  ,  x  (q  +  1)  and  (,  +  1)  X  5  respectively 
as  follows: 


and  M(y)  =  YM(y)X  Vy<OL 
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It  is  easily  checked  that  the  conditions  for  the  validity  of  Lemma  1.3.1  are 
satisfied  and  therefore  S  =  {M(v)}  *  equivalent  to  l  The  definitions  of  X  and  7 

also  guarantee  that  9eQsJ2. 

a 
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Chapter  2 

HMC’s  as  Models  of  Stationary 
Processes 


The  consistency  of  the  Maximum  Likelihood  Estimator  (MLE)  for  HMC  s  was  es¬ 
tablished  in  [4]  under  the  assumption  that  the  true  distribution  of  the  observations 
comes  from  a  HMC.  Here  we  show  that,  if  Yt  is  stationary  and  ergodic,  the  MLE 
taken  on  a  class  of  HMC’s  converges  to  the  model  closest  to  the  true  distribution  m 
the  divergence  sense.  The  result  in  [4]  is  therefore  a  special  case  of  ours.  In  Section 
2.1  we  briefly  review  the  misspecified  model  approach  that  we  followed  here.  In 
Section  2.2  we  present  our  version  of  the  consistency  of  the  MLE.  In  Section  2.3 
we  prove  a  slightly  generalized  version  of  the  Shannon-McMillan-Breiman  theorem 
which  is  related  to  the  consistency  results  of  the  previous  section.  The  final  Section 
2.4  settles  a  technical  problem. 

2.1  The  misspecified  model  approach  to  para¬ 
meter  estimation 

Suppose  a  given  series  of  observations  {j/i,J/2  •••</«}  is  to  be  modeled  for  some 
specific  reason.  For  example  we  might  want  to  predict  y„+i  or  compress  {yi---yn} 
for  storage.  Confronted  with  this  problem  a  statistician  would  most  likely  set 
up  a  related  parameter  estimation  problem  as  follows.  He  or  she  would  first  as¬ 
sume  that  the  sample  is  generated  by  some  unknown  stochastic  mechanism,  let 
us  say  yk  =  9k{u),  1  <  k  <  n.  The  observed  data  sample  is  now  interpreted  as 
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the  initial  segment  of  a  realization  of  an  unknown  stochastic  process.  Based  on 
prior  information,  insight,  and  mathematical  tractability,  a  class  of  models  would 
then  be  selected.  The  models  in  the  class  will  be  denoted  {/*(-,  9),9eQ}  where 
{fk(-,9)}k> i  is  a  stochastic  process  whose  probability  law  is  completely  specified 
by  the  parameter  9.  The  modeling  problem  is  now  reduced  to  an  estimation  prob¬ 
lem.  According  to  some  specified  criterion  of  optimality  the  statistician  selects  a 
model,  i.e.  estimates  the  9,  that  best  fits  the  data.  Let  us  call  the  estimator  based 
'  on  n  observations  9n. 

How  are  we  to  judge  the  quality  of  9nl  Ideally  we  should  compare  /fc(*A)  to 
gk(.)  but  the  latter  is  unknown.  There  are  two  possible  solutions.  The  classical 
one  is  to  assume  that  the  unknown  process  gk  is  actually  a  member  of  the  selected 
class  i.e.  gk{-)  =  h(;90)  for  some  true  (but  unknown)  90.  The  estimator  9n  is 
then  judged  to  be  good  if  it  behaves  well,  uniformly  with  respect  to  90eQ.  Based 
on  this  idea  a  great  deal  of  statistical  theory  has  been  developed  on  the  asymptotic 
properties  of  various  estimators. 

The  second  approach  (which  we  prefer)  does  not  rely  on  the  existence  of  a 
true  parameter  90  in  0.  After  all  the  class  of  models  was  chosen  more  or  less 
arbitrarily,  why  should  gk  belong  to  it?  The  problem  is  transformed  into  one  of 
best  approximation.  A  distance  <*(•,  •)  between  probability  measures  is  introduced 
and  9 ,  is  defined  as  d(Pg,Pg.)  =  min(  d{P„P,).  The  estimator  9n  is  judged  to  be 
good  if  it  is  close  to  9..  In  the  statistical  literature  this  is  known  as  the  misspecified 
model  approach.  We  learned  about  misspecified  models  from  Nishii  [19]  which 

treats  the  iid  case. 

2.2  HMC’s  as  misspecified  models  of  stationary 
processes 

In  this  section  we  introduce  our  first  statistical  result  involving  HMC’s.  We  observe 
the  process  Yt  with  values  in  the  finite  set  y.  The  only  assumptions  on  Yt  are 
stationarity  and  ergodicity.  Denote  by  Q  the  probability  distribution  on  y 
induced  by  Yt.  The  class  of  models  for  Yt  will  be  9  :=  ©J  with  q  and  5  fixed  (see 
Section  1.4  for  the  notations).  Notice  that  we  do  not  assume  a  prion  that  Q  =  PeQ 
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for  some  0oe* •  Instead  we  are  adopting  the  misspecified  model  approach  descnoed 
in  the  previous  section. 

Our  goal  is  to  prove  the  analog  of  the  consistency  of  the  maximum  likelihood 
estimator  in  this  set  up.  Toward  this  end  define. 

hn(d,r):=-iogP,(rn  (2A> 

71 

Following  the  terminology  from  [19]  we  define  the  quasi-maximum  likelihood 
estimator  (q.m.l.e.)  9{n)  as: 

9{n)  :=  {0e$;  hn{6,Y)  =  sup  hn{9,Y)}  (2-2) 


Remark:  0{n)  is  defined  as  a  set  because  no  unicity  is  guaranteed  for  this  class 
of  models.  It  is  easy  to  see  that  in  (2.2)  the  sup  can  be  replaced  by  a  max. 

We  need  a  notion  of  “distance”  between  Q  and  the  P/ s.  A  reasonable  choice 
justified  by  its  widespread  use  in  statistics  and  engineering  would  be  the  divergence 


rate: 


D(Q  lift) 


log 


QP?) 

ftp?) 


(2.3) 


Clearly  we  must  prove  the  existence  of  the  Emit,  for  this  distance  to  be  well  defined. 
Refering  to  the  proof  given  later  we  state  here  that  indeed  2.3  is  a  legitimate 

definition. 

It  will  also  be  proved  later  that: 


D{Q  ||  Pe)  =  HQ-HQ{9)>0 


(2-4) 


where  Hq  :=  Eq  [log  Q(Y0  |  Yl*)\  is  minus  the  entropy  of  Yt  under  Q,  and  Hq (9)  : 
Eq[\ogPa(Yo  |  YTi)]  is  a  well-defined  and  continuous  function  of  9e^. 

Next  define  the  quasi-true  parameter  set  as: 


X  :=  D{Q  ||  Pj)  =  nun D{Q  ||  P9 )} 


A  direct  consequence  of  (2.4)  is  that: 

W=  {0etf;  Hq{9)  =  ma, kHq(9)} 


(2.5) 
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For  the  proof  of  Theorem  2.2.1  we  need  the  following  result,  proved  m  Section  2.4. 

hn(9,Y)  -»•  Hq(9)  a.s.  Q,  uniformly  in  6.  (2-6) 

Remark:  We  recall  the  notion  of  a.s.  set  convergence  that  will  be  used.  For 
any  subset  5C$  define  the  £-fattened  set  £c  :=  {0e'£;  p{9,£)  <  e},  where  P 
is  the  euclidean  distance.  Then  9(n)  ->  M  a.s.  Q  if  V£  >  0  3  N{e,w)  such  that 

Vn  >  N(s,u),  9(n)  C  Me. 

We  are  now  ready  to  state  our  result: 

Theorem  2.2.1 

9(n)  -*■  M  a.s.  Q 

Proof:  Recall  that  $  :=  0*  is  compact.  Fix  £  >  0.  jVe  being  open,  the 
complement  is  compact.  Cover  N't  with  euclidean  open  balls  B{9,  Xg)  centered 
at  9,  and  of  radius  A*.  The  radii  A*  can  be  chosen  so  that  V0,  B{9,Xe)  C  K tc/2 
strictly.  Let  B{6,  Xg)  be  the  closure  of  B(9,  Xe).  The  following  chain  of  inclusions 

is  easily  verified: 

Mt  C  (J  B{9,  Xe)  C  1J  B{9,Xe)  C  jV;/2 

g^c 

By  the  compactness  of  N’t  there  exists  a  finite  sub  covering. 

N't  C  U  Bi  c  U  c  ^C/2  (where  Bi  := 

i=l  «=1 

Let  ft*  ■-  Hq{9)  | ectf  (be.  the  maximum  value  attained  by  Hq{-)).  By  the  uniform 
convergence  of  hn(9,Y ): 

max  /in(0,y)— >max  Hq{9)  =  Hq  -  a,-  a.s.  Q 

Bc8i 

for  some  a;  >  0.  Therefore  for  n  large  enough. 

max  hn(9,Y)  <  Hq  — — 

9iB  i  * 

Piecing  together  the  I  balls  B{  and  letting  a  :=  min,-  a,-  we  have: 

max  hn(9iY)  <  Hq  —  — 
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On  the  other  hand  the  uniform  convergence  of  hn  also  implies  that: 

sup/i„(0,y)-»supi?(0)  =  Hq  a.s.Q 

Serf, 

and  therefore  for  n  large  enough: 

sup  hn(9,  Y)  >  Hq  -  — 

Bo'S,  L 

Comparing  (2.7)  and  (2.8)  we  conclude  that  9(n)  C  K- 


This  proof  is  even  simpler  than  the  one  given  by  Baum  and  Petrie  [4]  for  the  case 
of  perfect  modeling  (i.e.  Q  =  P*0  for  some  90eV)  because  it  uses  the  uniform 

convergence  of  hn(9,  Y). 

2.3  A  generalization  of  the  Shannon-McMillan- 
Breiman  theorem 

In  this  section  we  present  a  slightly  generalized  version  of  the  Shannon-McMillan- 
Breiman  (SMB)  theorem  and  prove,  en  passant ,  (2.3)  and  (2.4).  The  SMB  theorem, 
first  introduced  by  Shannon  in  1948,  has  already  a  rich  history  of  extensions  and 
generalizations  vestiges  of  which  are  found  in  its  very  name.  The  classic  version  of 

the  theorem  is  the  following: 

Theorem  2.3.1 

Let  Yt  be  a  finitely  valued  stationary  ergodic  process  with  probability  distribution 
Q(- ).  Then: 


I  log  Q{Y ln)  -  Eq[ log  Q{Y0  |  YZ^)}  a.e.  and  in  Lx 
n 


In  this  form  the  theorem  has  direct  application  in  Information  Theory  because  it 
allows  the  estimation  of  the  entropy  rate  of  a  finite  alphabet  stationary  ergodic 
source.  Generalizations  of  theorem  2.3.1  have  appeared  for  the  case  of  real  valued 
processes.  Barron  [3]  gives  the  following  version. 
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Theorem  2.3.2 

Let  (ft,*)  be  a  sequence  space  i.e.  ( 1  =  11?  find  7  =  Bf  ^ere  71  is  a  standard 
Borel  space  and  B  its  Borel  a-algebra.  Let  M  be  a  finite  order,  stationary ,  Markov 
measure  on  {Cl,7F)  and  Yt  a  stationary  ergodic  process  with  values  in  71  and  distri¬ 
bution  Q.  Assume  absolute  continuity  of  the  n-th  order  marginal  of  Q  with  respect 
to  the  n-th  order  marginal  of  M  and  denote  the  corresponding  density  by  q(yZ  x). 
Define  the  divergence  rate  of  the  true  distribution  Q  with  respect  to  the  reference 

measure  M  as: 

Di(Q  ||  M)  :=  lim  EQ[logq(Yk  |  Vo*-1)] 

Then  Di{Q  ||  M)  is  well  defined  and  moreover: 

I  log  q(y*)  -*  DX(Q  H  M)  a.e.  and  in  Lx 
n 

□ 


In  this  form  the  theorem  becomes  very  useful  in  statistics,  (see  [3]  for  some  appli¬ 
cations). 

The  requirement  that  M  be  Markov  for  2.3.2  to  obtain  seems  to  be  almost 
necessary  (see  Kieffer  [14]).  Our  result  generalizes  Theorem  2.3.2  to  reference 
measures  M  of  the  HMC  type  but  it  applies  only  to  finitely  valued  processes. 

Theorem  2.3.3 

Let  Yt  be  a  process  with  values  in  the  finite  set  y.  Assume  Yt  to  be  stationary  ergo  ic 
under  the  probability  distribution  Q  and  a  HMC  under  the  alternative  distribution 
PeQ5q  for  some  fixed  q  and  6.  Let  q(Yx)  =  p{y*)  an ^  define- 

DX{Q  ||  P)  :=  lim -Bq [log  q(Yk  1 1?"1)] 

Then  Dx  is  well  defined  and  moreover: 


Q(Y\) 
P(YT ) 


DX(Q\\P)  a.e.Q 


(2.9) 

□ 
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The  proof  will  be  given  through  a  series  of  lemmas.  First  we  will  prove,  with  the 
help  of  Lemma  2.3.1,  the  existence  and  finiteness  of  D\.  Lemmas  2.3.3  and  2.3.4 
wifi  allow  us  to  find  a  more  explicit  expression  for  £>,.  Using  this  new  expression 
it  wiE  be  easy  to  complete  the  proof,  i.e.  show  (2.9),  applying  the  ergodic  theorem 
and  the  basic  inequality  for  HMC’s  (A.3). 


Lemma  2.3.1 


■8  <  P(Yk  I  yofc_1)  <  1  -  5  Vfc,  vy,  VPe©* 


Proof: 

P{Yk\Yk~l)  =  ’52P{Yk,Xk  =  i\  Yt1) 

=  P(Yk  |  Xk  =  i)  P(Xk  =  i  |  Yt1) 

i 

Therefore  VY^-1 

min  P(Yk  \Xk-i)  <  P^  I  Yk~l)  ^  mfxP(Yk  |  =  *) 

i 

and  under  the  assumption  that  PeQ6q  we  have: 

V  P,  k,  Yq  :  8  <  P(Yk\Yk~l)  <  1-5 

a 


Lemma  2.3.2  Dx  exists  and  is  finite 


Proof:  Define  R(Yk )  :=  Q(*?_1)  W  I  *o -1)-  Xt  is  ef ily  verified  that  R  *  a 
probabihty  measure  on  cr{Y0k)  and  that  q(Yk  |  Yq"1)  =  Jp&f  BemS  a  likelihood  ra 
tio  {q(Yk  I  ycf"1),  (7(yofc)}  is  an  P-martingale  and  from  the  convexity  of  the  function 
x  log  x  it  follows  that  {q{Yk  |  Yq~1)  log  q(Yk  |  Y0k~1),  *(Yk)}  is  an  P-submartingale. 
All  of  this  is  trivially  verified  and  it  implies  that  {logq(Yfc  |  Y0  ),  cr(Y0  )}  is  a 
Q-submartingale.  This  can  be  seen  as  follows: 


EMYt  i  y„M)  log  ?(n  i  Yt')  i  Yt') 

£*(»  I  Yt'Wv.  I  Yt^sdy’1  ' v*'1' 


I  Yt') 


Vk 


Q(Yk-l)P(yx  I  Yq-1L  .  Q^\\^llogq(yk  I  Yk~l) 
5  Q{Y0k~2)P{Yk-i  |  Yt1)  P{yk\Ytl ) 
q(Yk-\  I  Yk-2)EQ[\ogq{Yk  I  Yq-1)  I  yofc_l1 


l] 
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Therefore,  since  q  log  q  is  an  E-submartmgale: 

EQ\iogq(Yk\Ytl)  l^_1] 

=  - L-— £H[«(n  I  Ytl)  log,(U  I  Yt')  I  Yt'\ 

q{Yk-\  I  Yq  ) 

.  =  bg  ,<n-i  i  y<>‘'2) 

-  I  n‘‘2) 

which  proves  the  assertion. 

From  the  Q-submtg  property  of  log, (U  I  Yt')  we  immediately  conclude  that 
Dl  exists  since  the  expectations  EQ{lotq(Yk  |  ye4'1)]  increase  in  k  and  therefore 
have  limit  (possibly  +oo).  The  finiteness  of  Di  is  obtained  as  follows.  For  a  fixed 

k  we  have: 

£g(iog5in  ]  Yt')\  =  Eq\\ozO(Yi:  j  v;r!)l 

-  £e[iogP(n  |yo‘‘1)] 

The  first  term  on  the  RHS  equals  ^[log^o  |  yrt')l  (by  stationarity)  and  this 
converges  to  minus  the  entropy  of  Y,  which  is  bounded  by  log  |  y  |.  The  second 
term  on  the  RHS  is  bounded  because  of  Lemma  2.3.4.  This  concludes  the  proof  of 

the  lemma. 


Remark:  By  stationarity  we  have  that: 

£>1(<3  ||  P)  =  hm  £q  [log  q(Yo  |  Y:i)]  (2-10. 

The  following  Lemmas  2.3.3  and  2.3.4  will  allow  us  to  find  a  more  explicit  expres 
sion  for  D\. 

Lemma  2.3.3 

log  q(Y0  |  yjfc1)  -+  Z  a.e.  Q  and  in  Lx,  for  some  r.v.  Z  m  Lx 
Proof:  It  follows  from  Barron  [3]  (equation  2.7  on  page  1295)  that. 

Eq  fsup  1  logged  i  Ylk)  ll  £  e  II  P)  +  (e  +  3) 

k 
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Since  in  our  case  Dr  <  oo  this  proves  that  the  random  variables  log  q(Yo\YZt) 
are  dominated  by  a  g-integrable  r.v.  (and  are  therefore  uniformly  mtegrable).  We 

also  have: 

sup  sEq[log+  q(Yo  |  Y_k )] 
k 

<  sup  Eq[\  log  q(Yo  |  Y_k  )  | 

k 

<  ^[sup  I  log  q(Yo  !  Y_k  )  |]  <  CO 

k 

Since  {log  ?(Yo  |  Yli),  *(Y%)}  is  a  Q-snbmtg  it  follows  from  standard  theory  that 
log,(ro  I  Yll)  -  Z  me.  Q.  The  Q-integrability  of  Z  and  the  convergence  m  U 

follow  from  uniform  integrability. 


Lemma  2.3.4  P{Y0  |  Yj)  converges  to  a  limit  P{Y0  |  O)  uniformly  m  Pe0, 
and  y(u;) 

Proof:  Let  fk  :=  P(Yo  \  YZk)  then  h  converges  iffT-U  f’*'  -  !>  cottver*es' 
From  the  application  of  inequality  A.0.3  it  is  easily  seen  that  the  last  senes  con¬ 
verges  absolutely  and  therefore  it  converges.  The  convergence  is  undorm  m  P  and 
Y(w)  because  p  in  A.0.3  is  independent  from  them. 


What  makes  Lemma  2.3.4  remarkable  is  the  fact  that  the  convergence  of  P{Y0  ] 
Y-i)  is  uniform  in  Y{ at).  A  similar,  but  weaker,  result  holds  for  the  stationary 

measure  Q.  In  particular  Q( y0  \  «)  -  *  b°“ded  ^  “d 

therefore  it  converges  me.  Q  to  the  limit  Q(yo  I  Ki).  Since  y  is  fate  it  also 

follows  that  Q{Yo  I  ICfc1)  -»■  Q{Yo  I  Yz£,)  a.e.  Q. 

Since  Q(Y„  1  YZi)  -  <3(14  I  VTi)  me.  Q.  and  P(Y0  |  Y~k)  -  *(14  I  — )  or 

all  K(u>)  we  have: 

i  (y  i  y-1  )-*  log  ^y°  -  a-e-Q- 
\ogq{Y0  \Y_k)~>  iOg  p^yQ  j  y-^) 
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From  Lemma  2.3.3  and  the  unicity  of  the  limit  we  can  identify  the  r.v.Z  as: 

„  ,  Q(Y0  |  yri)  .  „ 
z  =  1  s  p{Y0 1  rriT  Q 

From  the  fact  that  log4(y0  |  YSi)  ->  Z  m  h  and  (2.10)  we  Snally  conclude  that: 

Q(y0  | 


DiW  II  P)  =  Eq 


log 


P(Yo  |  Y£)  J 

To  complete  the  proof  of  Theorem  2.3.3  it  is  now  sufficient  to  show  (2.9)  i.e., 
using  the  last  expression  for  D i,  that: 


1 .  <2(Y0n_1)  _  F 

hm  -log  n-ir  -  Eq 

n-oo  n  P(Y0  ) 


log 


Q{Yq  1  Yll) 


P(Y0\Y^)\ 

This  can  be  obtained  from  the  ergodic  theorem  with  the  help  of  A.0.3.  Define: 


(2.11) 


gk(Y)  :=  log q(Y0  |  Ylk) 
g(Y )  :=  log  q(Yo  |  Yl^) 


and  denote  by  T  the  shift  operator.  Equation  (2.11)  now  becomes: 

lim  -  g  gk  (: TkY )  =  Eq  \g{Y)\  a.e.  Q 

n  fc=Q 

while  the  ergodic  theorem  gives: 

UmIn£s(r‘y)  =  .E9[y(y)] 

n  feO 

To  finish  observe  that: 

I  -  £  gk(Z*Y)  - 1 2  «(T‘y )  15 

n  fc=0  n  k=Q 

1  -  £  log  Q(Yk  |  Y0fc_1)  —  —  E  *°S  Q(¥k  |  Y.ao  )  j  + 

|  I  g  log  P(Yk  |  Y^)  -  i  E  log  P(Yk  |  Y_fc-)  | 
n  J5 i  n  *=° 

The  first  term  on  the  RHS  equals: 

1 1  i0g  q w1)  - 1  log  q  w-’  i  y-‘i)  i  -  o  (u.c.  o) 

n  n 


29 


Since  both  MogW1)  and  MogW1  I  Fri)  converge  (a.e.  Q)  to  minus  the 
entropy. 

For  the  second  term  on  the  RHS  we  have: 

i  iy>gp(n  i  rtl)  -  iiog  p(y„  i  y£)  i< 

n  k=o  n 

I  V  |  log  P{Yk  I  Yt1)  -  log  P{Yk  I  F*"1)  H  0  everywhere. 

nfc=o 

The  convergence  to  zero  follows  from  the  application  of  A. 0.3  and  Lemma  2.3.1. 
This  completes  the  proof  of  Theorem  2.3.3. 

Aside: 

We  prove  here  (2.3)  and  (2.4)  from  Section  2.2.  In  (2.3)  we  defined  the  diver¬ 
gence  rate  as:  D(Q  ||  P )  :=  lim  ~-E<3[l°g  p(y°n-1)^ 

Clearly  D{Q  ||  P )  =  lim  £  £fc=o  EQ[\ogq{Yk  |  Yf"1)].  From  the  definition  of 
Di{Q  ||  P)  given  in  Theorem  2.3.3  and  the  Cesaro  convergence  theorem  it  follows 
that  if  Di  exists  then  D  exists  and  D  =  D\.  From  Lemma  2.3.2  we  therefore 
conclude  that  D  is  indeed  well  defined.  From  what  we  have  just  proved  we  have: 

D{Q  ||  Pe )  =  Eq  log  |  yi^ ) 

Define: 

Hq(0)  :=EQ[\ogPe(Y0\Y^)} 

From  Lemma  2.3.4  we  know  that  Pe(Y0  |  Fj£)  is  the  uniform  limit  of  the  se¬ 
quence  of  continuous  functions  Pg(Yo  |  Y_f )  and  is  therefore  continuous  in  0eQq. 
Continuity  of  Hq(6)  follows. 

Defining  HQ  ~  Eg  (log  Q(Y0  |  3^)1  we  get  D(Q  II  P»)  =  ~  HqW  which  is 

the  decomposition  claimed  in  (2.4). 

□ 

2.4  Uniform  convergence  of  hn(6,Y) 

We  prove  here  (2.6)  of  Section  2.2,  i.e.  that: 

hn(0,Y)  -*  Hq(9)  a.e.  Q,  uniformly  in  de$ 
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In  (2.1)  we  defined  hn(0,Y)  :=  Mog  P,(Y ?)  and  from  the  results  of  Section  2.3  we 
already  have  that: 

-log Pg(Y Ln)  -►  Hq(0)  a.e.  Q,  pointwise  in  9e§ 
n 

Since  is  compact,  to  conclude  that  the  convergence  is  uniform  it  is  enough  to 
show  the  equicontinuity  of  the  functions  hn(9,Y). 

Lemma  ‘2.4.1  hn(9,Y )  is  an  equicontinuous  sequence 

Proof:  We  will  show  that  Ve  >  0  there  exists  8{e)  >  0  such  that: 

Vn  j  hn(9,  Y)  ~  W  Y)  |<  e  if  \B  -  9'  !<  6{e) 

This  can  easily  be  seen  working  directly  with  the  Markov  process.  St  =  (Xt,Yt) 
which  has  state  space  T  =  X  x  y.  If#  =  {?,  A,B},s  —  —  {j,y)  then  the 

transition  matrix  T  of  St  has  elements  t3S  :=  Pg(St+i  —  s  \  St  —  s)  a,jbjy. 

9eQsq  the  matrix  T  is  strictly  positive  and  admits  a  unique  invariant  vector  r.  We 

have  that: 

n 

= 5?) = Tjj  n  t’w = r«  n  cp 

i=i  (*.s) 

where 

n33  :=  53  l{St  =  5,  St+i  —  s) 

<=1 

Let  now  9'  :=  {q,  A',B'}  be  another  point  in  0j.  We  have: 

|  hn(9,  S )  -  hn(9‘,  S)  |  <  |  \  log  r4l  -  -  log  <  |  + 

I  —  53  n»5  i°s  c?  I 

n  ais  n  ^ 

Since  <  1: 

<  —  |  log  r3l  —  log  r's  |  +  53  I  ^°S  iaS  ~  I 

71  s,5 

Define 

A  (MO  :=  13  I  IoS*«  ”  loS^s  I 
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From  the  expression  for  ta3  given  above  we  have  that: 

A (0,  O')  <  E  I  a»i  “  loS  a'ii  I  +  E  I  lo§  biy  ~  log  bjy  I 

L  ij  iy 

Since  all  parameters  are  >  6  this  shows  that  for  some  C  >  0 

a (o,o')<c\\o-o'  |(i 

Since  r  is  an  eigenvalue  of  T  of  geometric  multiplicity  1  its  components  are  con¬ 
tinuous  functions  of  the  components  of  T  and  therefore. 

|  —  log rsi  --log<  |<  ^  \\0-9'  ||i 
n  Ti  7i 

The  final  estimate  is: 

\hn(0,S)-hn(9’,S)\<  2 C  \\0-0'  ||i 

Therefore  Ve  >  0  there  exist  N(e)  and  6(e)  such  that  for  n  >  N(e) 

|  hn(0,S)  -hn(0',S)  |  <  €  if  ||  0  -  O'  ||  <  6(e)  (2.12) 

To  go  back  to  the  Y  process  observe  that  (2.12)  can  be  written  as: 

,  1  Wl)  ,  ^  . 


Therefore  from 


we  get: 


A  i02  HizJlL  I  <  g. 
n°gMS?) 


P$'(Si)  <  exp{7ie}Ps(5”) 


my?)  =  E 

Xn 

=  E  p0'(si)  -  exP(ne} E  ps(Xi ,  Y? ) 

xn  X1 

=  exp  {ne}Pe(Y?) 

and  similarly  exchanging  the  roles  of  0  and  O’. 


Remark:  The  idea  of  working  directly  with  the  process  St  was  suggested  by 
Chuangchun  Liu. 
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Chapter  3 

Estimation  of  the  Order  of  a 
Markov  Chain 


As  originally  planned  this  should  have  been  a  short  review  chapter  on  the  appli¬ 
cations  of  the  Law  of  the  Iterated  Logarithm  (LIL)  in  estimation  problems.  For 
the  reasons  explained  in  the  introduction  to  Section  3.4  below,  we  decided  to  show 
the  LIL  in  action  on  a  real  problem:  the  estimation  of  the  order  of  a  finite  order 
Markov  chain.  We  will  later  make  use  of  these  results  in  the  context  of  HMC’s, 
where  finite  order  Markov  chains  will  be  useful  to  approximate  the  distribution  of 

the  HMC. 

In  Section  3.1  we  briefly  present  a  version  of  the  LIL  for  square  integrable 
martingales  following  Neveu  [18].  Section  3.2  shows  an  application  to  Markov 
chains  and  gives  a  result  on  the  estimation  of  the  stationary  vector.  In  Section  3.3 
we  introduce  the  notion  of  finite  order  Markov  chain,  give  sufficient  conditions  for 
its  ergodicity  and  study  the  asymptotics  of  the  Maximum  Likelihood  Estimator 
(MLE)  of  the  transition  matrix.  The  delicate  rate  estimate  given  in  Theorem  3.3.2 
is  the  key  to  Section  3.4. 

We  start  Section  3.4  clarifying  the  notion  of  order  as  “minimal  memory  of 
the  finite  order  Markov  chain,  and  then  use  our  asymptotic  results  to  construct  an 
estimator  of  the  order.  The  basic  idea  of  Section  3.4  occured  to  us  while  reading 
the  beatiful  booklet  by  Azencott  and  Dacunha-Castefle  [1]  on  the  estimation  of  the 
order  of  ARMA  processes.  Another  useful  source  of  ideas,  especially  for  Section 
3.3,  has  been  Nishii  [19]  where  the  iid  case  is  treated. 
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3.1  The  LIL  for  Square  Integrable  Martingales 

We  sketchily  present  here  the  version  of  the  LIL  that  is  more  convenient  for  our 
purposes.  Everything  is  standard  and  can  be  found  in  full  detail  m  e.g.  Neveu 
[18].  Readers  familiar  with  the  result  announced  in  the  title  are  advised  to  only 

browse  through  this  section. 

The  classical  version  of  the  Strong  Law  of  Large  Numbers  (SLLN)  asserts  that 
if  X,  is  a  sequence  of  Hi  random  variables  with  E  \  X,  |  <  oo  and  EX 1  = 
then  IS,  -  fi  a.e.  where  S„  :=  ££,,  X,.  Under  the  additional  hypothesis  that 
EX l  <  oo  the  variance  <r2  :=  E(X,  -  p)J  is  finite  and  the  rate  of  convergence  of 
-S„  can  be  evaluated  as  follows: 

n  71 

|  -S„-H=  Zn 

n 

where  hm  Zn  =  1  a.s. 

This  is  the  classical  Kolmogorov’s  LIL.  Many  versions  of  the  LIL  have  been 
developed  to  extend  Kolmogorov’s  result  to  the  non  iid  case.  We  will  be  content 
with  the  version  for  square  integrable  martingales  as  given  in  Neveu  [18]  pg-  14i- 
156.  The  statement  of  the  theorem  is  followed  by  a  brief  comment  on  its  conditions 

and  implications. 

Theorem  3.1.1 

Let  ( Xn ,  neN)  be  a  square  integrable  martingale  such  that 
supn  j  Xn+i  -  |<  c  a.e.  for  some  finite  constant  c. 

If  An  denotes  the  increasing  process  associated  to  the  submartingale  (X*,neN) 
then: 


1  a.s.  on  [Aoo  =  °°] 

— 1  a.s.  on  [Aso  =  °°1 


□ 
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Comments  3.1.2 

a)  The  sequence  of  r.v.’s  Xn  is  a  square  integrate  martingale  if: 

i)  EXl  <  co  VneiV;  and 

ii)  E[Xn  |  F„_i]  =  Xn-i  where  Fn- 1  is  the  <r-field  generated  by  X”  l. 

b)  The  increasing  process  An  (associated  to  the  Doob  decomposition  of  X;). 

is  given  by: 

An+i  —  An  —  E[X2+1  |  Fn ]  —  Xn 

But  for  every  r.v.  Y  and  sub- sigma-field  B: 

E[(Y  -  EiY  1  B))2  |  B)  =  E[Y 2  |  B\  -  ( E[Y  |  B])2 

and  therefore  we  obtain: 

An+ 1  ~  An  =  E[(Xn+ 1  ~  Xn)2  |  Fn\ 


or: 

a.  =  jr  E[(xt  -  x*.,)2 1  ft-J 


with  the  convention  that  Xg  —  0. 

c)  A  weaker  form  of  the  result  is: 


lim 


\Xn\ 

An  log  lOg  An 


1  a.s. 


(3.1) 


This  can  be  immediately  inferred  from  Theorem  (3.1.1)  and  will  be  used  very 
often  for  our  results. 


d)  Definition  (0 a.3) 

Let  Zn  be  a  sequence  of  r.v.’s  and  an  >  0  a  sequence  of  positive  reals.  We  say 
that  Zn  =  0a.3{ctn)  if  there  exists  a  positive  random  variable  C  almost  surely  finite 

such  that:  |  Zn  |  <  Can  Vn. 

e)  We  will  often  be  able  to  substitute  An  in  (3.1)  with  n  and  get 


lim 

n 


1  X,  I 

yjn  log  log  n 


=  a.s. 


for  some  constant  p.  Which  easily  gives  Xn  =  Oa.,Wn  log  ioS^) 
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3.2  Application  to  Markov  Chains 


As  a  first  example  of  application  we  will  use  the  LIL  to  find  the  rate  of  convergence 
of  the  maximum  likelihood  estimators  of  the  parameters  of  a  Markov  chain.  The 
derivation  of  the  results  is  only  sketched  because  it  will  be  given  in  full  detail,  for 
a  more  general  case,  in  Section  3.3.  Trivial  as  it  might  seem,  Theorem  3.2.1  is  a 
little  puzzling  because  it  cannot  be  derived  directly. 

Let  ( Xt,teN )  be  a  finite  Markov  chain  with  state  space  X  =  {1,2,  •  •  •  ,q}  and 
assume  that  the  transition  matrix  A"  of  X,  is  strictly  positive.  This  is  equivalent  to 
the  existence  of  a  S  >  0  such  that  a;,  >  6  (Vi  ,j).  To  A'  there  corresponds  a  unique 
invariant  vector  whose  components  r]  >  &  (Vj).  We  want  to  estimate  A'  from 
the  trajectory  X '{*.  It  is  convenient  to  take  as  parameters  9  :=  {<bj  i  —  1) 

The  maximum  likelihood  estimator  of  9  based  on  n  observations  is 

given  by: 

N{i,n) 


where 

n— 1 

n)  :=  ^2  =  h^t+i  ~  l) 

4=1 


and 

N(i,  n )  :=  l(^t  =  0 

t=i 


By  the  SLLN: 


s«'i(n)  a*ij 


a.s. 


What  is  the  rate  of  convergence?  Observe  that: 

.  N(i,j,n)  -ajjNfan)  (3.2) 

ab(n)  “  a*i  ~  '  N(i,n ) 


Define: 


Mij{n )  :=  N{iJ,n)  -  a^N^n) 
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It  is  easily  seen  that  M,-(n)  is  a  square  integrate  martingale  satisfying  the  condi 
tions  of  Theorem  3.1.1.  The  corresponding  A„  process  is  given  by: 

An  =  N(i,  n )  a£(l  -  a*j) 

By  the  SLLN  -*  <  a.s.  and  therefore: 

— a-s- 

n 

Since  by  hypothesis  the  RHS  is  strictly  positive  we  have: 


An  ->  co  a.s. 

We  also  have  that  (defining  :=  <  a»(l  -  <j)): 

Um'4’ll°Sl°g— =  A'i  «* 
n  n  log  log  n 

Substitution  into  (3.1)  gives: 

US =  a,, 

n  r/n  log  log  n 


Therefore  M,j(n)  =  0a.3.{Vn  1°S  1°S  n) 

Dividing  numerator  and  denominator  in  (3.2)  by  n  we  get: 


_  .  /log  log  n . 

Mn)  ”  a‘i  =  ^ 


Theorem  3.2.1 


iV(i,  n) 
n 


.  ,  /log  log  n 

-*i=oaA\J  n 


) 


(3.3) 


Proof:  First  we  observe  that  AT(i,n)  -n  *?  U  not  a  martingale  anymore  and 
therefore  the  previous  method  for  the  determination  of  the  rate  cannot  be  app  e 
The  idea  we  use  is  the  following.  Let  Nc{i,j,n)  and  N  (i,n)  denote  the 
taken  with  the  circular  convention  (i.e.  considering  X,  as  the  successor  o  „). 

From  the  asymptotic  point  of  view  nothing  changes  smce:  Nc(i,j,n )  x 
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and  Nc{i,n)  =  iV(i,n)  ±  1  but  now  it  is  easily  verified  that  defining  the  vector 
and  the  matrix  An  via: 


(«■»)■' 

(An)ij 


Nc{i,n) 

n 

Nc{i,j,n ) 
Nc(i,  n ) 


we  have: 

An  —  TTn 

From  the  previous  analysis  we  know  that  with  an  '•=  ?n?  ' 

7rn  =  7rn(A*  +  Oa,.(an))  ( component  wise ) 

Subtract  tt*  =  tt^A*  from  both  sides.  Then 

7Tn  —  7T*  =  (irn  —  7T*)A*  +  irn  Oa.,j.  (On) 


which  we  can  rewrite  as: 

(xn-0 (/-A*)  =  Oa.,.  (an) 

But  A*  >  0  by  hypothesis  and  therefore  rank  (7  -  A')  =  q  -  1  (because  the 
eigenvector  tt*  of  A*  has  geometric  multiplicity  1).  Let  A  be  a  minor  of  order 

q  _  1  and  rank  q  -  1  of  A*  and  denote  by  (*«  -  *r*)A  the  corresponding  (<?  -  1)- 

.  ,  .  *  rrvt  _  /” a  =  O  (a_)  and  since  A  is  invertible 

subvector  of  7rn  —  7T  .  Then  f7rn  tt  JA  I  nj 

(7Cn  -  7T*)A  =  Oa.s.  K) 

Let  j  the  index  of  the  component  of  7rn  -  t’  not  contained  in  (irn  -  tt  )a. 
Clearly: 

(«■„);  =  i  -  £(*„);  =  i  -  E  (v  +  K) 

is  a 

and  the  conclusion  is  that: 

xn  -  tt*  =  OaA]j ( component  wise ) 

This  is  in  perfect  agreement  with  (3.3). 


38 


3.3  Rates  of  convergence  of  the  MLE 

The  results  collected  here  will  be  essential  in  the  next  section  where  we  solve  the 
problem  of  the  estimation  of  the  order  of  a  Markov  chain  Xt  from  the  observations 

{*?}• 

Definition  3.3.1 

The  stationary,  finitely  valued,  process  (Xt,  t  >  1)  i*  a  finite  order  Markov  chain 
if  for  some  integer  m  >  0 

P(Xt  =  j  I  X{-1)  =  P{Xt  =  j  |  X\zL)  a.e.  V  t  >  m  +  1,  V  jeX 

where  X  =  {1,2,  •••,5}  is  the  state  space  of  Xt. 


This  is  the  classical  definition  (see  Doob  [9]  pg  89)  and  it  is  somewhat  unsatisfac¬ 
tory  because  it  does  not  uniquely  specify  what  the  order  is.  For  the  time  being  we 
will  say  that  a  chain  has  order  less  than  or  equal  to  m  if  m  is  any  integer  satisfying 
Definition  3.3.1.  A  precise  definition  of  the  order  will  be  given  in  Section  3.4  below. 

The  case  m  =  0  corresponds  to  an  iid  process,  and  m  =  1  to  a  Markov  chain. 
The  probability  distribution  of  a  finite  order  stationary  Markov  chain  of  order  <  m 
is  completely  specified  by  the  set  of  transition  probabilities  (t.p.): 


a(im,j)  “  P{Xt  =  j  1  XHl  =  im), 


and  by  the  initial  probabilities  (i.p.): 

:=  P{X?  =  im), 

Observe  that  Xt  is  stationary  only  if  ^(i771)  is  an  invariant  measure. 

The  probability  of  the  cylinder  {X?  =  x?},  for  n  >  m+  1,  in  terms  of  t.p.,  and 

i.p.  is: 
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ptx?  =  x?)  =  p(xr  =  *r)  n  p(x<  - Xi  i  xt-™  ~ 

t=m+ 1 

=  f[  «m^t) 

f=m+l 

=  »w)  n  (3-4) 

im,j 

where: 

N(r,j,*)~  t  i(x;z'm  =  r,xt  =  j)  (3-5) 

t=m+l 

Later  in  this  section  we  will  need  the  SLLN  for  functionals  of  Xt.  The  stationary 
of  allows  us  to  use  the  a.e.  version  of  the  ergodic  theorem  to  get  the  required 
SLLN,  but  to  keep  matters  as  simple  as  possible  we  will  assume  ergodicity. 
Remarks:  K  we  also  assume  that  the  observations  consist  of  the  initial  segment  of 
one  trajectory  then  at  no  extra  cost  we  may  assume  that  there  is  only  one  ergodic 

class. 

Conditions  for  the  ergodicity  of  Xt  can  easily  be  given  in  terms  of  the  t.p. 
a(im,j),  but  to  express  them  nicely  we  need  to  introduce  a  new  process. 

Definition  3.3.2 

Let  Xt  be  a  stationary  process.  The  m-th  derived  process  Yt  is  defined  as.  Yt 

(Xt,  Xt+i  •  •  •  Xt+m-i)?  t  >  1 

a 

If  Xt  is  a  finite  order  Markov  chain  of  order  <  m  then  it  is  immediately  seen  that 
Yt  is  a  Markov  chain. 

The  transition  probability  matrix  T  of  Yt  is  of  size  qm  x  qm  with  at  most 
qm+i  eiements  different  from  zero  since  obviously: 

f -mjm  =  0  unless  jl  =  12,  •  •  •  ,jm-l  =  lm 
t.mjm  =  a(im,jm )  when  j\  =  *2,  •  *  •  Jm- 1  =  lm- 

The  following  lemma  will  give  a  sufficient  condition  for  the  ergodicity  of  X*  in 
terms  of  T . 
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Lemma  3.3.1 

Let  Xt  be  any  stationary  process  and  Yt  its  m-th  derived  process.  If  Yt  is  ergodic 
then  Xt  is  ergodic. 


Proof:  Ergodicity  of  a  stationary  finitely  valued  process  Zt  is  equivalent  to 
(Walters  [24]  pg  41):  V  u  >  1,  v  >  1,  z1,  zl 

I £  p{zi  =  zf,  zttl  =  4)  p(zi  =  zi)  p(z'  =  z^>  a-s ■  ^ 

n  k—0 

Write  (3.6)  for  Yt  =  {Xu  Xt+l  ■  ■  ■  Xt+m-i)  then  we  have  V  v!  >  1,  V  v'  >  l: 

lim  l-fp(X^  =  =  z^”1-1) 

n-o°  n  fr'o 

pt  W+m-i  =  x“'+m~1)  P(X1u'+m_1  =  xf+m_1)  a.s. 


This  shows  that  Xt  satisfies  (3.6)  V  u  >  m,  v  >  m.  For  the  case  u  <  m,v  <  m 
just  observe  that: 


E 


pu+l»  ^v+1 


p(xr  =  ij.  =  xj)  = 

p(x?  =  x?,  ^r+i  =  xj+„  xts  =  xh"-ri 


'v+l 


) 


Condition  (3.6)  is  verified  for  each  term  on  the  RHS  and  therefore  is  satisfied  also 
by  the  LHS.  For  the  other  cases,  (u  <  m,  v  >  m)  and  (u  >  m,  v  <  m)  the  proof 

is  analogous. 


□ 


Remark:  The  converse  of  Lemma  3.3.1  is  false. 

It  is  wed  known  that  for  a  (finitely  valued)  Markov  chain  Yt  with  t.p.m.  T  the 
following  conditions  are  equivalent: 

i)  Yt  is  ergodic 

ii)  T  is  irreducible 

iii)  there  exists  a  unique  invariant  vector  t  (t  =  tT) 

iv)  all  elements  of  t  are  strictly  positive 
Applying  Lemma  3.3.1  we  now  have: 
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Corollary 

A  finite  order  Markov  chain  of  order  <m  is  ergodic  if  the  t.p.m.  T  of  the  m-th 
derived  process  Yt  =  (Xu  ■  •  »  irreducible 

□ 


Since  the  initial  probabilities  of  the  7,  process  are  in  one-to-one  correspondence 
with  the  initial  probabilities  of  the  Xt  process,  we  conclude  that  when  T  is  irre¬ 
ducible  there  is  a  unique  set  of  strictly  positive  initial  probabilities  M*m),  imeXm} 
corresponding  to  the  t.p.  a(im,j).  The  x(^Ys  can  be  found  solving  the  equation 
tT  =  t.  When  T  is  irreducible  a  set  of  parameters  that  completely  specifies  the 
probability  distribution  of  the  corresponding  Xt  chain  is: 

9  :=  (a(tm,j)  j  =  1, 2,  •  •  •  q  -  1} 

We  now  study  the  Maximum  Likelihood  Estimator  (MLE)  of  9. 


Theorem  3.3.1 

Let  XteX  =  {1,2,  •■•9}  be  a  finite  order  Markov  chain  of  order  <  m  and  assume 
that  the  m-th  derived  process  Yt  is  ergodic. 

Let  0.  ;=  {a‘{im,j),imeXm,j  <  q  -  1}  be  the  true  parameter  of  Xt: 

The  MLE  of  9*  is  given  by: 


N(im,n) 


where: 


n) 


E  =  J”,  x,  =  j) 

<=m+ 1 
n+1 

E  = 


The  MLE  converge  as  to  9 ,  with  rate: 


_  .  I  log  log  n . 

a{im,j,n)  =  a‘(imj)  +  0a.3.{\J  ~  ) 
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The  exact  asymptotics  for  a*(imj)  ±  0  are: 


1 


2  a»(»m,j)(l 

TT*{im) 


jj-  loglogn  1/2 
n 


w'.i.  »)-«•(•' i»  - -f  ^ 


a.  s. 


2  fl*(»w,j)(i-a*(»m>i)) 


-  a.s. 


Proof:  The  MLE  is  obtained  by  direct  computation.  The  asymptotics  follow 
from  Theorem  3.1.1.  First  we  deal  with  the  trivial  cases.  If  for  some  {im,j)  the 
t.p.  a*(im,j)  =  0  then  N(im,j,n)  =  0  with  probability  1  for  every  n  and  therefore 

the  rate  condition  is  trivially  satisfied. 

We  now  assume  )  >  0. 

Observe  that: 

,  .,.m  N(imJ,n)~a'(im,j)N(im1n)_  (3.7) 

a(i  ,],n)  a  (*  il)  N(im,n ) 

The  numerator  in  (3.7)  is  a  martingale,  indexed  by  n,  with  respect  to  <r{X?}.  To 
see  this  define  for  given  (im,  j)  and  t  >  m  +  1: 

u(t)  :=  1  (X}:1  =  im,xt  =  j)  -  E[l(XlZm  =  im,Xt  =  j)  |  xH  (3.8) 

The  process  u(t)  is  centered  at  the  conditional  expectation  given  a{X[-1}  and  is 
therefore  automatically  a  martingale  difference.  The  expectation  in  (3.8)  can  be 
computed  explicity: 

E[ i(x',:l  =  r,x,  =  j)  I  xj-1]  =  =  n 


And  substituting  in  (3.8): 

u(t)  =  1  (Xlzl  =  im,  Xt  =  j)  -  a^J)  l(Xtzl  =  n 

Since  u(t)  is  a  mtg  difference  the  process  M(n)  defined  for  n  >  m  +  1  by: 

M(n):=  Y,  u(t ) 

t=m+l 

is  a  martingale  with  respect  to  n{X?}.  M(n)  coincides  with  the  numerator  of 
(3.7)  thus  proving  the  claim. 
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Af(n)  is  square  integrable  because  it  is  bounded,  moreover 
j  M(n  +  1)  -  M(n)  j  =  |  u(n)  |  <  2 

thus  verifying  the  technical  conditions  of  Theorem  3.1.1. 

The  increasing  process  A„  is  given  by  (see  comment  3.1.2b) 

An=  £  £?[«a(t)  I  -XT1! 

£  —  77l“f*l 

The  f-th  term  is: 

E[u\t)  |  Xj"1]  =  H^i-m  = 

and  therefore: 

A„  =  a'(im,i)(l  —  a’(i”\  j))  iV(i",n) 


Dividing  both  sides  by  n: 

—  =  <■•(«”, j)(i 

n 

By  the  SLLN 

jV(im,n) 

n 


jV(zm,  n) 
n 


a.s. 


Define: 

:=  a*(im,j)(l  — 

Under  our  hypotheses  P{im,j)  >  0  and  therefore: 


lim  An 

n— *-oo 

An  log  log  An 

lim  - : - ; - 

n— oo  n  log  log  n 


4-co  a.s. 

a-s- 


Theorem  3.1.1  now  gives 


Af(n) 
lim 


log  log  n 

V  M(n) 
-^V'nloglogn 


=  a.s. 


=  -y/2p(im,j) 


a.s. 


From  here  minor  algebraic  computations  give  the  exact  asymptotics.  The  rate 
of  convergence  follows  immediately. 

□ 

The  following  theorem  will  play  a  central  role  in  the  next  section  (for  the  oider 
estimation  problem). 


Theorem  3.3.2 

Let  Xt  be  as  in  theorem  3.3.1.  Then: 


t  log  ph(xt)  =  \  log  a.W)  +  (3-9) 

Moreover  the  following  bounds  on  the  asymptotics  hold: 

Ito(loglogn)-'  (logPjnW)- log  P.-W))  <C,qm  (9-1)  (310) 

ijrnflog log T'-l  (log^n(V")  -  logPg^X, ))  >  0  (3.11) 

where  Cv  :=  2^  and  rj  min,mj{a  (i  ,j)} 

Proof:  Expanding  in  Taylor’s  series: 

-  log  Pg  -  -  log  Pg*  =  ~  Tjx  log  P 9  \e .  {K  ~  9*) 
n  e"n  n  OV 

1  n  d2logPg  |  /) 

+2  (0n  -  e-y-~w~ (0n " 

where  rn  =  o(||  9n  —  9 «  ||2)  =  o(loRng  ) 

The  derivative  wrt  a(imJ )  is  (see  (3.4)): 

d  d  .  ,W|  ,rjV(^,i,n)  jVgWOi 

^r-)logPe  a^T) log7r(  1 }  l<?*  [ 

Asymptotically,  after  normalization  by  i,  the  first  term  is  negligible.  Therefore: 

8  .  „  ,  N(im,j,n)  N(i m,g,n) 

da(im,j)  °g  «•(*"*,  j)  «*(«m,9) 


(3.11) 
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The  scalar  product  of  the  derivative  and  6n-d.  is  given  by  (we  drop  the  dependence 
from  n  in  the  N's): 

N(im) 


E 


N(im)  N(im)  r.il-ar.W")) 

j)  iV(im)  a'(*m>9)  iV^m) 

N(i~)  ,N(i »  7)  - 

-  ^,3^ - X 

Where  the  sums  must  be  taken  over  all  .“s*”  and  j  <  1  -  1.  Sum  over  ;  the 
second  Y1  3nd  add  to  the  first  to  got. 

where  the  sum  is  now  extended  over  all  (i  ,  j)tX 

From  Theorem  3.3.1  and  the  trivial  inequality  lim(E)  <  £  hm  we  have: 

d  a 

hm(loglogn)"1  ^  logo’s  \e*  (^*  ~  ^*) 

<  e.e=  f^)"'  w 

Theorem  3.3.1  applies  directly  to  the  terms  with  index  j  <  q  -  1,  but  terms  with 
j  —  q  must  be  dealt  with  separately.  We  do  the  latter  first. 

a(im,q)-a{im,q)  =  -  £  («(*"\i)  -  a*(im,j)) 

j<?-i 


and  therefore: 

(a(r,  $))2  = 


J<?_1 

<  (g  —  1)  £  (o(zm,  j)  —  a  {i  ,j )) 

i<9-i 


The  total  contribution  to  (*)  from  the  terms  with  j  —  g  is  therefore: 


/Tog  log  n\  1 

lun^  l  n  I  «•(«■»,«) 


:m  \ 
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< 

< 


,  ^  T7 —  ( log  ^°g  n 

(<z-i)E  E  lim 

jm  j<q— 1 


-1 


71 


—EEL  (a(imJ)-a*(r,j)y 

q) 


7r*  (zm )  ,a*(»w,j-)(l-a*(»w»i)) 

(9  ~  1)  2-j  — “  r  ' 1 


7T*(lm) 


(3.12) 


On  the  other  hand  the  contribution  to  (*)  from  the  terms  with  j  <  q  1  is  upper 
bounded  by  (from  Theorem  3.3.1  directly). 


E 


7r*(im) 

=  £  2(1  —  a*(im,  j)) 


(3.13) 


Adding  together  (3.12)  and  (3.13)  we  get  the  final  upper  bound: 

n  r  a‘(im,j)(  l-a*(EliiI  +  2  £  (l-a*(^m,i)) 

<  F  ( 2(g  ~  ll  4. 2) 

<  Li“j<9-i  ^*(*"*,9)  > 


With  7/  :=  min{«-(»"\9)}  and  (5,  :=  2(*=i  +  1)  we  get  the  best  possible  bound  i.e 
C,  <Zm(?  -  !)•  Taking  Cn  :=  2*  >  C,  we  get  the  looser  but  simpler  bound  given  in 

the  statement  of  the  Theorem. 

We  now  proceed  to  the  direct  evaluation  of  the  quadratic  term  m  Taylor  s 
expansion.  For  the  sake  of  readability  notation  will  be  kept  to  the  bare  essential. 
In  particular  we  will  drop  the  dependence  from  m  on  the  first  index  i.e.  a{im,j) 
will  be  denoted  etc. 

We  start  by  observing  that: 


i  |llogPs  I,.  =  |j£».(logft)  I,.  +o(D 


Where: 


Eg*(\ogPg)  =  E  f  E  <  <ilog a«',i  +  7r*  <9loSa*.<J 
»  \i<9— 1  1 


The  first  derivatives  are: 


d 

dah,k 


Ee*( log  Pe)  =  ah,k  *h  - - ah,q  *h 

uh,k 


1 

Qh,q 
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The  second,  derivatives: 

d 


- -  Eg^logPo)  =  -*h  al,k  ~~2  *(M)(e.«)  77 k  ah,q  2  Sh’e 

dah,k  dat,m  ak,k 


The  5’s  are  Kroneker  symbols. 

d 


9ah,k  9ae, 

The  quadratic  form  is  now 


1  c  ,1c 

Ee.{logPe)  k=  -Tfc  —  0(fc,fc)(e,m)  -  *h  -jr 

(L »  T 


•h,k 


(«„  -  e.)r  ^  £«(i°g  ft)  l«.  («» - e-) 

—  —  ^2  X/  (“**  —  —  S!m) 

(h,k)  ,k<q—\  (e,m),m<q- 1 


TTl  c  .  “Phc 

—r-S{h,k)(e,m)  + 


ahq 


Some  cumbersome  algebraic  manipulations  give  the  final  form: 

=  -  £  (iu  -  «w)J  -£ 

(W) 

The  quadratic  form  is  negative  definite,  as  was  to  be  expected. 
Obviously: 

sTd2  log  P$ 


lim  (log  log  n)-1  (0n  -  9.) 


d9 


\e.(°n-9.)  <0 


and  therefore  this  term  does  not  influence  the  global  upper  bound.  For  the  lower 
bound  (3.11)  observe  that  by  definition: 

ft„  W)  >  p,.  (xr) 


from  this  (3.11)  follows  trivially. 


□ 


3.4  Estimation  of  the  order 

After  the  publication  of  [17]  we  thought  the  results  of  this  section  would  loose  some 
of  their  interest,  but  after  careful  studying  we  decided  to  present  them  for  two 
reasons.  We  have  been  unable  to  convince  ourselves  of  the  validity  of  some  of  the 
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arguments  given  in  [17].  Moreover  our  results  do  not  intersect  those  given  in  [17] 
and  are  obtained  by  a  totally  different  method.  The  problem  can  be  roughly  posed 
as  follows.  We  observe  the  process  Xt  which  is  known  to  be  a  finite  Markov  chain 
of  order  m\  the  transition  probabilities  of  Xt  and  the  order  m*  are  unknown.  Our 
goal:  construct  a  consistent  estimator  of  m*.  To  formulate  the  problem  correctly 
we  must  first  define  the  order  of  a  finite  order  chain. 

Definition  3.4.1 

i)  The  order  of  a  finite  order  Markov  chain  Xt  is  the  minimum  m  satisfying 
definition  3.3.1 

ii)  A  representation  of  a  finite  order  Markov  chain  of  order  m  is  any  set  of 
transition  probabilities  and  stationary  initial  probabilities: 

{a(im',j),  7T (im/)  {im,J)eXml+l} 

with  m!  >m  that  generate  the  probability  distribution  of  Xt.  m!  will  be  called  the 
memory  of  the  representation. 

Hi)  A  minimal  representation  is  a  representation  whose  memory  equals  the 
order. 

Remark:  The  notions  of  order  and  minimality  introduced  here  do  not  coincide 
with  those  of  System  Theory.  Roughly  said,  in  System  Theory  the  order  is  the 
cardinality  of  the  smallest  state  space  that  allows  a  description  of  the  process. 
If  Xt  is  a  stationary  (standard)  Markov  chain  then  its  system  theoretical  order 
would  be  the  cardinality  of  the  set  of  ergodic  states,  because  no  transient  state 
could  ever  be  observed  from  any  trajectory  of  the  stationary  chain  (the  invariant 
probabilities  associated  to  transient  states  are  ah  zero).  In  our  definition  the  state 
space  X  is  fixed  in  advance. 

Theorem  3.4.1 

Let  Xt  be  an  m-th  order  Markov  chain, 

M  :=  {a(im,j),7r(im)} 

a  minimal  representation  of  Xt,  and  Yt  the  m-th  derived  process: 

i)  There  is  only  one  minimal  representation  if  and  only  ifYt  is  ergodic. 
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ii)  For  all  rri  >  m  one  representation  is  given  by: 

M!  := 

where  a'(im  ,j)  =  a(im'-m+nj) 

7r'(ira/)  =  7r(im)  a(im,  Wi)  •  •  •  <*(*”■ 'C  C 

Hi)  The  m'-th  derived  process  Yt  is  ergodic  for  all  rri  >  m  if  and  only  if 
a(im,j)  >  0  for 

iv)  For  any  rri  >  m  there  is  only  one  representation  with  memory  rri  if  and 
only  if  a(im,j)  >  0  for  all  (im,  j). 

Proof:  i)  If  Yt  is  ergodic  then  from  the  general  fact  that  two  distinct  stationary 
ergodic  processes  are  mutually  singular  we  conclude  that  M  is  unique.  On  the 
other  hand  if  Yt  is  not  ergodic  at  least  one  of  the  stationary  probabilities  7r(im)  =  0 
and  therefore  the  corresponding  “row”  of  t.p.’s  a(i™,j)  can  be  changed  without 
altering  the  probability  distribution  and  we  may  therefore  construct  infinitely  many 
representations  equivalent  to  Ad . 

ii)  For  rri  >  m  we  have: 

a'(Cj)  :=  P(Xm.+1  =  j  \  x?  =  n 

'  =  P(xm>+1  =  j  I  X£!_m+1  =  C'-m+l)  =  a(C'-m+l’i) 
ri(r')  :=  P(X r  =  n  =  P{X&  1  =  Cl  I  XT  =  c  p(x T  =  ^m) 

=  7r(zm)  a(im,im+l)  a(?2  ?*m+2)  '  '  ’  im') 

It  follows  from  the  definition  that  these  are  the  t.p.’s  associated  to  the  m'-th  derived 
process  Yt  =  (Xf-  Xt+m-i)  when  X,  is  generated  by  M. 

iii)  If  a(im,j)  >  0  Vim,j  then  for  all  m'>m  the  chain  can  move  between  any 
two  sequences  of  states,  X?  and  X? ,  in  at  most  rri  +  1  steps.  This  is  more  than 
required  for  the  ergodicity  of  the  m'-th  derived  process  for  all  rri  >  m. 

On  the  other  hand  if  a(im,j )  =  0  for  some  (P",  j)  then  the  t.p.  matrix  of  the 
(m-fl)-th  derived  process  has  a  “column”  of  zeros  and  is  therefore  not  irreducible. 
It  follows  that  the  (m  +  l)-th  derived  process  is  not  ergodic. 

iv)  If  a{iF,j)  >  0  Vim,i  then  for  any  rri  >  m  the  m'-th  derived  process  Yt  is 
an  ergodic  Markov  chain  of  order  1.  By  i)  Yt  has  a  unique  minimal  representation 
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which  must  therefore  coincide  with  M'  constructed  in  ii).  If  on  the  other  hand 
=  0  for  some  («”, j)  then  the  (m  +  l)-th  derived  process  is  non  ergod.c 
as  proved  in  Hi)  and  therefore  its  representation  of  memory  m  +  1  is  non-unique 
again,  by  i). 


For  the  proof  of  the  consistency  of  the  order  estimation  procedure  we  will  need  the 
ergodicity  of  the  m'-th  derived  processes  for  all  rri  >  m  so  that  the  SLLN  will  be 
valid  for  all  m'  >  m.  This  fact,  in  view  of  Theorem  3.4.1,  justifies  the  following 

assumption: 

Assumption  SP:  .  _  •  v  _  n  o  .  \  I 

The  observed  process  is  a  finite  Markov  chain,  taking  value  m.  -{  ’  “  "  ' V  ’ ! 
of  unknown  order  m%  and  unknown  strictly  positive  transition  probabilities 

K(*w»ii)}- 

And  now  we  can  formulate  our: 

Problem:  ,  r 

Let  X,  be  a  process  satisfying  assumption  SP.  From  the  observat.on  of  an 
arbitrarily  large  initial  segment  of  one  trajectory  of  X,  construct  a  strongly] 

1  consistent  estimator  of  the  unknown  order  m\ - - - 

The  most  natural  parametric  model  for  the  process  Xt  is  given  by  the  following: 


Definition  3.4.2 

©m  ;=  [all  possible  >0  z  ,j  =  1,2,  ••  q  1} 

0  :=  U 

m>0 

The  results  of  Theorem  3.4.1  now  tell  us  that  0„  contains  no  representation  of  X, 
if  m  <  m*  and  only  one  representation  of  Xt  for  any  m  >  m". 

Definition  3.4.3 

The  compensated  maximum  \og-likelihood  is  defined  as: 

C(m,n)  :=  -Ln{9m(n))  +  Sn{m) 

Where: 

§m(n)  is  the  ML  estimator  of  9eQm  based  on  n  observations 
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Ln(L{n))  :=  Mog  P§m(n)(X?) 

8n(m)  is  a  positive ,  increasing  function  of  m  to  be  specified. 

Definition  3.4.4 

m(n)  :=  min{argmin  C(m,n)} 

We  will  show  how  to  choose  the  functions  8n(m)  to  guarantee  the  strong  consistency 
of  m(n). 

The  idea  of  studying  the  compensated  likelihood  and  the  technique  for  the 
choice  of  the  8n{m)  functions  follow  Azencott,  Dacunha-Castelle  [1].  Obviously 
our  main  result  (Theorem  3.3.2)  and  its  technique  of  proof  are  totally  different. 

Theorem  3.4.2  Compensators  Avoiding  Underestimation 
If  limn  8n(m)  =  0  Vm  then: 
hm  m(n)  >  m*  Pe-—a.s. 

Proof:  Define  Ln{0)  :=  Mog  Pe(X?).  From  Lemma  2.4.1  we  have: 

Ln(9*)  _  Ln{9)  ->  D{Pe •  ||  Pe)  a.s.  and  uniformly  in  0eQm  Vm. 

Therefore  for  m  <  m*: 

inf  [. Ln{6 *)  -  Ln{6)\  =  L„(r )  -  Ln(9m{n))  -*  mf  D{Pe>  ||  Pe)  ■=  7  >  0  (3-14) 

(7  >  0  since  no  point  in  0m  is  equivalent  to  Xt  for  m  <  m*). 

Theorem  3.3.1  shows  that  0m.(n)  ->  9*  a.s.  which  implies: 

lim  [Ln(e‘)  -  Ln{L-{n))}  =  0  a.s.  (3-15) 

n 

From  (3.14)  and  (3.15) 

hm  [Ln(9m‘(n))  —  Ln(9m(n))]  =  7  >  0  (3.16) 

n 

By  the  definition  of  C{m,n)  and  the  condition  limn  8n{m)  =  0,  (3.16)  gives: 

hm  [C{m,  n )  -  C(m*,  n)]  =  7  >  0 

n 
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for  any  m  <  m\  On  the  other  hand  [C(A(n),n)  -  C(m',n)]  <  0  by  definition 
of  m(n)  and  therefore  we  conclude  that  all  the  Hmiting  values  of  m{n)  must  be 
>  m* .  i.e.  lim  m(n)  >  m*  as  claimed. 


□ 


We  now  study  the  conditions  to  be  imposed  on  the  function  Sn(m)  to  avoid  over 
estimation.  Theorem  3.3.2  will  be  our  main  tool. 

Lemma  3.4.1 

For  any  m  >  m~  we  have: 

toglogn  >  [in(jm(n))  _  !„(«„.(„))]  <  -  1) 

n 

Proof: 

loglogn^  (£n(jm(n))  -£„($„.(»))] 
n 

<  n5(-0gI^-)-1  [Ln{9m{n))  -  Ln{9,)} 

—  n 

-  1^(1^)-!  [£„(»„.(»))  -  i,(«.))l 

<  C„qm(q-  1) 

To  bound  Urn  and  lim  we  used  theorem  3.3.2  which  is  vahd  for  any  m  >  m 


Theorem  3.4.3  Compensators  Avoiding  Overestimation 
If  the  compensator  is  of  the  form 
Sn{m )  :=  ¥>(n)  h(m ) 
where  the  function  y>  satisfies: 

lim(i2^)-i  ^(n)  >  i 

and  the  function  h  satifies: 

h{m')  -  h{m)  >  C „  <T  (<Z  -  1)  /or  all  m'  >  m  >  0. 
Then: 

lim  m(n)  <  m*  Pe.  a-5. 
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Proof:  We  now  assume  m  >  m',  and  from  the  form  of  Sn(m)  we  have: 

C(m%  n)  -  C(m,n)  =  I„(U"))  -  UL-M)  +  9(")  ~ 

Therefore: 

—  log  log  n  !  _  C(m,n)) 

n 

<  ^^112*2)-.  [i.(j„(„))  -  I„(M"))1 

n 

+  Em(l0S  1°S-)~1  <p(n)[h(m’)  -  k(m)] 

77, 

The  first  term  is  bounded  by  (Lemma  3.4.1)  Cn  qm  (q  ~  !)• 

For  the  second  term  observe  that  by  hypothesis: 

k(mm)  -  h(m)  <  -Cv  qm  (q  ~  1)  (since  m  >  rn) 

Substitution  gives: 

_  loglogn  ,  ^(n)  [Mm-)  _  h{m)] 
n 

<  9(n)(-c,  r  (?  - 1)) 

—  '  n 

And  therefore: 

_  f  log  log  n 

m  V  n 

<  Cv  qm  (q  ~  1)  [1  -  lim  ^-0Sl°5-]  ^C*)] 

The  hypothesis  on  <p  makes  the  bracket  strictly  negative.  We  conclude  that  V  m  > 

___  * . 
m  : 

_  loglogn )-X(g(m->n)  _  C(m,n ))  <  0 
n 

On  the  other  hand,  by  definition  of  m(»),  C(m’,n)  -  C(m( a),n)  >  0  thus  the 
conclusion  lim  m(n)  <  m  . 

□ 


-l 


(C(m*,n)  -C(m,n)) 
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We  must  now  show  that  functions  h  and  V  satisfying  the  conditions  imposed  by 
Theorems  3.4.2  and  3.4.3  do  exist.  The  h  function  is  substantially  different  from 

the  h  function  of  [l]. 

Example  of  h  Function 

x  (<7  +  l)m+1 

h(m)  :=  Cv(q  -  1)  ~ 

We  must  check  that  for  all  0  <  m  <  m' 

h(m)  ~  h(m)  >Cr,{q-  1)  ?m' 


But: 


h(m')  —  h(m)  —  Cn  ( q  1) 

The  condition  is  satisfied  if: 


U  +  l)”'*1  _  (9  +  1) 
9  1 


m+11 


t  m-f  1 


(9+i)m'+1  _  (9  +  1)’  > 


for  all  0  <  m  <  ml 

Dividing  both  sides  by  qm'  we  get: 

/£  +  lyn'+l  _  7 -  >  1 


q  <1 

><1  +  1  yn'+lfi  _  1 


(i— — •)”*  "*"‘[1 


(?+  !)r 

- 1  >  1 

—m.  J 


q  (?  +  1)r 

Since  m'  >  m  the  term  in  brackets  is  greater  than  (1  -  £).  The  inequality  is 
therefore  satisfied  if: 

z1?  +  ^\7I»/+1  f - £ - }  >  l 


i.e. 


>1  Vm'>0 

9 


which  is  trivially  satisfied. 


55 


Example  of  <p  Function 

<p(n)  satisfies  both  Theorems  3.4.2  and  3.4.3  if  it  is  taken  as: 

(n)  _  Iogl°l-(l  +  e)  for  some  e  >  0 
T  v  ’  n 

□ 

The  reader  that  patiently  followed  us  may  object  that  Theorem  3.4.3  is  useless 
from  a  practical  point  of  view,  since  C„  depends  on  the  true  distribution.  Theorem 
3.4.4  will  reassure  him  or  her  on  the  practicability  of  our  approach,  but  first  let 
us  observe  that  in  the  ARMA  case  the  h  function  does  not  depend  on  the  true 
distribution  (see  [1]).  We  are  investigating  the  reasons  of  this  discrepancy. 

Theorem  3.4.4  Consistent  Estimators 
The  compensator: 

,  .  log  n 
Sn(m)  :=  h(m ) 

where  h(m )  is  any  strictly  increasing  function  ofm,  produces  a  strongly  consistent 
estimator  of  m* . 

Proof:  Sn(m)  -+  0  for  all  m  and  therefore  it  avoids  underparametnzation  as 
proved  in  Theorem  3.4.2. 

For  the  case  of  overparametrization  we  reason  as  in  the  proof  of  Theorem  3.4.3. 
Assume  m  >  m". 

imT(^^-)-1  [C(ro*,n)  -  C(m,n)] 

71 

<  [£»(«m(")>  -  in(«™-("))l  +  -  h(m)) 

Siace  the  difference  L«U»))  -  WM*))  =  0«C“S“)-  “d  by  hy?°theSiS 
h(mm)  -  h{m)  <  0,  we  get  for  m  >  m* 

[C(m*,n)  -  C(m,n)]  <  h{rn)  -  h{m)  <  0. 

On  the  other  hand  C(m',n)  -C(A(n),n)  >  0,  by  definition  of  m( n),  and  therefore 
hm  rh(n)  <  m *  i.e.  m  avoids  overparametrization  too. 
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Chapter  4 

Estimation  of  the  Order  of  a 
Hidden  Markov  Chain 


The  technique  that  was  employed  in  Chapter  3  for  the  estimation  of  the  order  of 
a  Markov  chain  will  now  be  adapted  to  the  estimation  of  the  order  of  a  HMC.  As 
we  have  seen  in  the  Markov  case,  the  crucial  step  is  the  evaluation  of  the  rate  of 
growth  of  the  maximized  likelihood  ratio  (MLR).  For  Markov  chains  we  evaluated 
this  rate  to  be  0„.,.(log  log  n)  (Theorem  3.3.2)  and  we  also  had  very  precise  results 
for  the  E5  and  the  (nfl  of  the  MLR.  For  HMC’s  we  will  be  able  to  get  the  rate 
CL.,. (log  log  n)  only  in  special  cases.  For  the  general  case  we  get  0„.,.( logn). 

"  At  first  the  problem  of  estimating  the  rate  of  the  MLR  for  HMC  seems  easy  to 
solve.  For  any  Pi  write:  Pa(yi)  =  Pa{Vi>xi)  =  Sri1  where  the  Proces 

St  =  (XtiYt)  is  a  Markov  chain. 

Clearly  maxg  Pe{Vi)  5:  Hr?  max«  ^{s ?)• 

Since  St  is  a  Markov  chain  we  know  from  Theorem  3.3.2  that: 

maxgPg(s?)  _  gan 

Pe oK) 


where  an  =  0a,3,(loglog  n) 

Substituting  in  the  previous  inequality  we  find: 

:P,(»r)  <  Y.  =  e“'  £  ft.M)  =  «-*(*> 


max. 
8 


From  this  we  immediately  get  the  desired  rate. 

losm^fiMi  =  0..„(loglogn) 

Pe  o(2/i) 
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This  idea,  or  variations  of  it,  has  appeared  in  the  literature,  but  unfortunately  it  is 
wrong.  The  problem  is  that  Theorem  3.3.2  does  not  say  that  a„  =  Oa.,.(log\ogn) 
uniformly  with  respect  to  the  realization  u. 

In  Section  4.1  we  pose  the  problem  of  the  order  estimation  for  HMC’s  and  prove 
the  analog  of  Theorem  3.4.2  on  estimators  that  avoid  underestimation.  In  Section 
4.2  the  MLR  is  studied  for  the  true  order.  In  this  case  we  get  the  0a.s.(l°g  log  n) 
rate  of  growth.  In  Section  4.3  we  approximate  the  MLR  using  Markov  chains  of 
finite  memory  and  get  a  rather  weak  bound  on  the  rate  of  growth.  This  result 
is  more  of  theoretical  than  practical  interest  because  the  bound  depends  strongly 
from  the  true  distribution.  In  Section  4.4  we  state  a  result  from  Information  Theory 
and  use  it  to  find  the  O0.s.(logn)  bound  on  the  rate  of  the  MLR.  The  final  Section 
4.5  is  dedicated  to  the  construction  of  strongly  consistent  estimators  of  the  order. 


4.1  Preliminaries 

In  Section  1.2  we  defined  the  order  of  a  HMC  Yt  as  the  minimum  integer  q  for 
which  there  exists  a  representation  of  Yt  with  |  X  |=  ?.  In  analogy  with  Section 
3.4  we  would  like  to  construct  a  consistent  estimator  of  the  order  based  on  the 
compensated  maximum  likelihood.  The  HMC  case  is  complicated  by  the  fact  that 
our  knowledge  of  the  set  of  equivalent  representations  is  only  partial  (see  Sections 
1  3  and  1.4).  To  cope  with  this  difficulty  we  have  to  impose  restrictions  on  the 
observed  process  Yt  thus  limiting  the  applicability  of  the  results.  Fortunately  all 
of  the  assumptions  are  satisfied  by  a  generic  HMC  and  therefore  the  results  are 

still  widely  applicable.  _ _ _ _ _ _ 


Assumption  SP':  . 

The  observed  process  Yt  is  a  HMC  taking  values  m  {l,2,---r},  of  unknown 
order  q0.  One  representation  of  Yt  is  given  by  60  =  {q0,  A0,B0}  where  90  is  a 
Petrie  point  of  Qsqo  for  some  6  >  0. _ _ _ . 


(Petrie’s  points  are  defined  in  1.4.1  and  09o  in  1.4.2). 
The  class  of  parametric  models  that  will  be  used  is 

0  :=  Ug>i©9- 
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The  results  of  Sections  1.2,  1.3  and  1.4  guarantee  that  0$  contains  no  point  equiv¬ 
alent  to  do  if  q  <  qo  and  a  finite  number  of  points  equivalent  to  90  if  q  =  go-  For 
q  >  qQ  there  are  infinitely  many  points  in  0*  equivalent  to  90 ,  as  can  easily  be  seen 
applying  Lemma  1.3.1.  In  analogy  with  Section  3.4  the  compensated  maximum 

log-likelihood  is  defined  as: 

C{q,n )  :=  -Ln{6q(n))  +  6n(q) 
where: 

§q(n)  is  the  MLE  of  9eQsq  based  on  n  observations 
Ln{9q(n))  :=  n  lo§  P9q{n)(Yl  ) 

8n(q)  is  a  positive  increasing  function  of  q  and  n  to  be  determined. 

The  estimator  of  the  order  is  defined  by: 
q(n)  :=  min{arg  min?>i  C(q,n)} 

The  problem  of  order  estimation  can  now  be  posed  as  follows. _ 

Problem: 

The  HMC  Yt  satisfying  assumption  SP'  is  observed.  Find  a  compensator 
sequence  $n(g)  such  that  the  estimator  g(n)  is  strongly  consistent  i.e.  q  -* 

a,s. 

The  analog  of  Theorem  3.4.2  is  vahd  and  we  can  easily  give  a  sufficient  condition 
on  8n(q)  that  avoids  underestimation. 

Theorem  4.1.1  Compensators  avoiding  underestimation 
Let  Yt  be  a  process  satisfying  conditions  SP '. 

If  limn_00  f*n{q)  =  0  (^  ?) 

Then  lim - q(n)  >  9o  P h  ~  a-s- 

Proof:  The  proof  is  completely  analogous  to  the  proof  of  Theorem  3.4.2  and 
based  on  the  essential  fact  that  for  q  <  q0  there  is  no  point  in  equivalent  to  90. 
This  last  fact  follows  easily  as  a  consequence  of  Lemma  1.3.2.  Since  90  =  {qo,  >1,  B) 
is  regular  any  9  =  {q,A,B)  equivalent  to  it  must  have  q  >  go- 
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Sufficient  conditions  on  «.(«)  to  avoid  overestimation  are  much  more  difficult  to 
find.  The  crucial  problem  is  to  determine  the  a.s.  rate  of  growth  of: 

.  W) 

s  p«,(n") 

In  Section  3.4  we  used  the  LIL  to  study  the  a.s.  asymptotic  behaviour  of  the  ratio 
but,  as  it  will  become  apparent  in  the  next  section,  for  HMC’s  this  technique 
works  only  for  q  =  go-  Theorem  4.5.1  will  give  sufficient  conditions  on  « ?)  to 

avoid  overestimation. 

*  s 

4.2  Rate  of  Convergence  in  0?o 

We  study  here  the  rate  of  growth  of  the  maximized  log-likelihood  ratio  (MLR) 

log  P*,(V D 

Since  g0  is  fixed,  in  this  Section  «.(«)  will  be  denoted  L  We  need  one  extra 
assumption  on  the  HMC  Y,  which  will  be  in  force  through  this  section. 

Assumption  PH: 

- 

Recall  that:  Hga(9 )  :=  EgQ[logPg(Y0  |  ^-co)l- 

After  giving  two  preliminary  results  we  will  prove  that  the  MLR  is  Oaa(log  log  n). 
Remember  that  (Section  2.2): 

9n  =  {9eQsqo  ;  =  ma xPfl(y?)} 

and  that  in  general  0„  is  not  a  singleton.  Our  first  result  shows  that  it  is  always 
possible  to  choose  a  convergent  sequence  9nt9n. 
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Lemma  4.2.1  There  exists  N  >  0  finite  and  a  sequence  9ne9n  (V  n  >  N)  such 
that  9n  — ►  90 

Proof:  From  Theorem  2.2.1  we  know  that  9n  ->■  fif  =  {0e0t  ;  HBo{9)  - 

TfiV  From  assumption  SP'  and  Theorem  1.4.1  it  follows  that  M  is  a  finite 
subset ”of  6*,.  In  particular  tf  =  {fleGj.  j  9  =  ^(«o)}  where  <t(9„)  denotes  the 
permutation’of  the  matrices  (As,  Bo)  induced  by  the  permutation  <r  of  the  state 
space  X.  As  a  consequence  of  these  two  facts  we  have  that  for  any  e  >  0  there 
exists  an  integer  N,  such  that  9n  C  H.  for  all  n  >  N, ...  (Here  K  denotes  the 
fattened  set).  Since  jV  is  finite  its  points  are  isolated.  Choose  e  small  enough  or 
M,  to  be  the  union  of  disjoint  baUs  of  radius  £  centered  at  the  points  ol  H  i.e. 

fiT,  =  Ua0(<7(0o))-  Let  N  =  NS. 

To  complete  the  proof  it  is  enough  to  show  that  for  all  n  >  N  there  exists  a 
point  ft,«B.(9.)  such  that  PK  (1?)  =  **.(*?)■  Let  M.  and  n>N.  Then  for 
some  <7  we  have  9eS,(<r(9 o))  and  from  the  identity<r(S,(9))  -  B,(ff(9))  we  cone  u  e 
that  a-(9)eBs(9 o).  We  can  define  K  ™  ™ 

that  9'  —+  90. 


From  now  on  we  will  suppose  that  the  choice  of  a  sequence  of  points  9n  -*  h 
has  already  been  made  and  with  slight  abuse  of  notation  we  will  denote  by  9.  the 

point  9'n  itself. 

The  lemma  below  will  be  needed  for  the  application  of  the  LIL. 

Lemma  4.2.2  For  some  finite  C,  Vfc,  VI,  V0: 


d9i 


log  Pg(yk  |  y\  x)  I—  c  a-s ■  ^d° 


Proof:  In  the  statement  9,  can  be  any  element  of  9  =  (A,B).  First  we  prove  the 
case  9[  :=  a,-;  from  some  (i,j)  with  1  <  t  <  qo  and  1  <  j  <  9o  1-  A  di 
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computation  of  the  derivative  gives: 

^-log  p.(»k  l  yf-1)  =  ^-‘°s «(»{)  -  s|j1<>gi’‘(!''"1) 
_ 1 _ 

jlfkMi  -  EfiMii  p#(x;  |  yf) 


dt‘j 


■^tqo 


jNUx^)  X) 


ztqo 


I  Pg(xt  =  i,  Jt+1  =  J  I  Vl)  _  jMfj  =  =  q°  1 


■i,:qo 


=  E 

Xk 
X1 

-  E 

fc-i 

-  Ei 

t~  l  \ 
fc— 2 

-Sy 

•  a»j  t=X 

_  -L  £  P*(xt  =  t\ xt+x  =  go  |  I/i)  ~  Pe{xt  =  *\  *«+i  =  I  rf-1) 

aiq°  t=l  .  jt- 

Pa(xk-i  =i,Xk=:  j  Ux)  _  Pe(pk-i^2l5^L^Ald. 


fPa(xt  =  i,xt^=j  Ux-1)  _  PB(xt  =  i,xt+l-go\yi ]) 


a{j 


aiq0 


+ 


atj 


*tq0 


From  Lemma  A.0.3  we  get  form  some  p<  1  : 

|  p5(Xt  =  ii  *t+i  =  *2  !  Vi)  ~  peixt  =  *i  x‘+1  =  z2  I  Vi  X)  I  ^  P 
From  the  triangle  inequality  and  the  bounds  6  <  ay  <  1  -  5  we  have: 

AlogP^lyf-)!  <  + 


2 

<  - 


+  1  :=c. 


-  6  \l-p 

The  proof  for  0,  =  6isr  for  1  <  j  <  ft,  1  <  S  <  r  -  1  »  very  similar  and  will  be 
omitted. 


We  are  now  ready  to  study  the  rate  of  convergence. 
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Theorem  4.2.1 


i  1  /log  log 

-iog^(yi)  =  -iogPao(y?)  +  0<--  [— — 

n  n  x 


n 


Proof-  The  proof  is  divided  into  three  steps.  In  step  1  we  use  the  LIL  for  mar- 

•• — “ *“■«*'  : 

Taylor’s  expansion  and  step  1  to  prove  that  dn  V0  “-*-W  n 

step  3  we  use  another  Taylor’s  expansion  and  steps  1,  2  to  conclude. 

Step  1:  _  .  v 

We  first  prove  that  §log  ft(rf)  k=  componentw.se.  For 

the  generic  component  6t  we  have: 


where 


^-logP«(y?)  Uo—  Ufc 
oQi  i 


Ufc  :=  — -  log  Pa(yj:  I  yi  )  k  • 
Out 


Computing  the  conditional  expectation 


e„{ uk  |  yi-1)  =  = » I  = » I  lfc- °- 

y 

We  see  that  (d^logPeti/i)  Iflo.'Kl/i)}  is  a  martingale.  _ 

Since  Y,  takes  values  in  a  finite  set  it  follows  trivially  that  the  martingae 
is  square  integrable.  Lemma  4.2.2  guarantees  that  1  ur  |<  C  a.s.  Pa,  for  som 
constant  C  and  therefore  we  can  apply  the  LIL  for  martingales  as  given  m  Theorem 

3.1.1.  In  particular 

|  ^  log  iMy?)  l*o 
t/TaJ oilogX 

The  definition  of  An  gives  (see  comment  3.1.2b) 


lim 


•n— *-00 


=  1  a.s.  on  [An  —*  co]. 


n 


Observe  that: 


k= l 


Ee0{u\  I  y*-1)  =  -Ee0(^ogPe(yk  |  yi'1)  kl  vt  ') 
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can  be  proved  by  direct  computation. 

To  complete  the  first  part  of  the  proof  we  show  that: 

lim  —  =  02 

n— oo  ft 

where  /32  :=  -$r fl»„W  |»„>  0  by  hypothesis  PH. 

This  is  a  consequence  of  the  ergodic  theorem  and  of  the  following  bound.  (C 
is  a  finite  constant  and  0  <  p  <  !)• 

l°gPe(yk  I  yk-tt)  kl  J/-oo)  ~  E&o (uk  \vi  )  -  c  p 

The  bound  is  proved  as  follows: 

l°SP*(yfc  *  y-~^  ”  Edo^2k  ' 

uVt 

=  E p*o(y  I  y-~)§jp log Pa{~y  1  y-~X)  'flo  ~p9o(y  1  y*  X) def log Pfl(y  1  Vl  ^  '9o 

y  l 

<  Y.p>°(y  i w[osP,{y  1  v-~]  u  ~wiloiP‘{v  1  y'"')  1,0 

y  1 

y  <Wl 

From  Lemma  4.1  of  Banm-Petrie  [4]  we  get  the  desired  bound.  For  ^  we  therefore 
have: 

T  -  k £*  (~W l0gP,iVk  1  y‘“}  kl  -  »  £/ 

taking  the  Umit  for  n  -  oo  and  applying  the  ergodic  theorem  we  conclude  that 
da.  /32  and  the  result  of  Theorem  3.1.1  becomes 

n  “ 


limn— c 


33,  log-Pg(Vl)  kl  _  2 /? 


In  log  log  n 


As  discussed  in  comment  3.1. 2d  this  is  equivalent  to  saying  that  a®,  log  )  k 


Oa.t..(VnioS  ioSn) 
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Step  2: 

Here  we  determine  the  order  of  6n  -  0O.  Clearly: 


i^iog  W)k=°  =  I* 

+  loS  ^(2/1 )  1 0o  (^n  ~  0o) 


1  d 


n 


n 


de 2 


+  0(11  0„  -  So  II2) 


It  follows  that: 


1  d2 

ndd2 

1  d 


log PaiVi)  k 


(L-do) 


=  -—log  P$(Vi)  k  +  °(ll  ^0  11  ) 
n  00 

Lemma  4.2.1  plays  a  crucial  role  here  because  it  guarantees  that  L  -  90  -  Oa.s. 
and  therefore  o(j|  L  -  #o  ||2)  is  negligible. 

1  ^  logPa(yi)  k~*  lflo>  0 


nde 2 


by  hypothesis  and  therefore  for  n  large  enough  the  LHS  is  invertible  and  we  have: 

1  d 2 


(0n-0o)  = 


log Pe(yi)  k 


nde 2 

From  Step  1  we  conclude  that 


i-^log Ps(Vi)  k  +°(ll  ~e°  II2) 

n  de 


{0n  —  6q)  —  0a.3. 


/log  log  n 
n 


Step  3 

To  complete  the  proof  expand  i log  PL(Vi)  in  Taylor’s  series  around  0„: 

llogT’s.tor)  = 

n  11 

+  -  ^  log  Ps  (y?)  |ffo  -  0o) 
n  o0 

+  (9»-«o)r^logftW)l».(^-W 

where  k  is  a  point  in  0»  such  that  ||  k  -  I  <  II  «■  "  *  I  “d  ***” 
k  -  The  matrix  i&logP.M)  I*.  »  >>°“nded  (it  actually  converges)  and 

from  the  results  of  Steps  1  and  2  we  conclude  that 
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—  log jPj  (j/i )  =  -  log Pen{Vi)  + 

n  n 


log  log  n 


n 


)■ 


4.3  Finite  memory  approximation 


We  present  here  a  result  on  the  rate  of  growth  of  the  MLR  obtained  using  an 
approximation  technique.  The  idea  is  to  approximate  the  HMC  process  wit  a 
sequence  of  Markov  chains  of  increasing  order  m.  The  result  is  too  weak  for  our 
purpose  of  estimating  the  order  of  the  HMC,  but  we  believe  the  technique  of  proo 
to  be  interesting  in  itself.  In  this  section  we  will  not  need  assumption  5  . 

only  hypothesis  on  Yt  will  be  the  following. _ _ _ 

|yt  is  an  HMC  of  order  q  admitting  a  representation  90  with  60eQ5^ 

The  standard  log-likelihood  ratio  will  be  denoted  1^(0)  i.e.: 


Rn(d)  :=  log 


PM) 

PM) 


and  for  1  <  m  <  n  the  m  order  approximation  R™{9)  is  defined  as: 

m  PsiVk  I  V-<  X — '  i  .  Psillk  |  Uk-m) 


y  log  . 
ytl)+JzUi  Ps^k 


I/fclm) 


i.e.  K{0)  is  obtained  keeping  track  of  only  the  m  most  recent  samples. 
resuhTgives  a  bound  on  how  well  !%{$)  approximates  &{$). 


Our  first 


Lemma  4.3.1 

|  *.(«)  -  iCW  I  * 

for  some  0  <  p  <  1 

Proof:  From  the  definitions  we  get: 

j  Rn(6)  -  Rn(0)  I  -  E  I  loS PeiVk  i  Vi'1)  ~  lo§  I  Uk-m)  I 

k=m+ 1 

+  y  I  log  Pe0  ( yk  |  yt1)  -  lo§  p9°  (yk '  ykZ^  1 

fc=m+l 
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For  each  term  of  the  first  sum  we  have: 

I  log  P${yk  1  V\~l)  ~  I  yk-m)  I 

<  k'jrl  I  log  P8(yk  1  Vh'1)  -  lo§  Pa^k '  I 

h=l 

Using  inequality  A.0.3  we  have: 

|  log  Pe(yk  |  yt1)  -  lo§  Pe^k  I  \-Hpkhl 
where  0  <  p  <  1-  Observe  that  the  bound  does  not  depend  on  6  and  is  therefore 
valid  also  for  the  terms  of  the  second  sum.  Adding  all  the  terms  we  get: 

,  2  i  k-h-i 

|  -  JC(9)  I  <  1  E  E  p 

0  k=m+\  h=l 


Some  minor  algebra  gives: 


n 


E 

fc=m+ 1 


k— m  — 1 

E  f 

h=l 


k-h- 1 


±—  y  (i  -pk-m-x) 

f-  P  fc=m+l 

n-m-1  .  1  -Pn~m 

- - - i  : 

-  -  P 


l-p 


< 


np 

T^p 


□ 


It  is  possible  to  interpret  JCW  »  the  log-likelihood  ratio  of  two  m  order 

Markov  chains.  Define: 

P3m(y?)  :=  p^yT)  f[  p9(yk\ykk-lm) 

k=m+ 1 

and  analogously  for  P£ 

The  process  Y,  becomes  an  m  order  Markov  chain  under  P»”  il  is  easl  ? 

seen  that  KW  =  log  Jgg-  Both  P ”  *ni  P»?  elem“t3  °£  ^ 

•p  •=  {  att  transition  probabilities  P{y o  |  yZL)  with  elements  >  <}. 

(Thisls  the  set  that  in  Section  3.4  was  called  0„  we  introduce  a  new  name  for  it  to 
avoid  confusion  with  the  parameter  set  of  the  EMC).  Let  P  be  the  generic  demen 
of  Vm  and  define  B“(P)  ~  log  Clearly  {P,(y o  I  V-l)  i  9e0.>  C  ”  “ 

therefore 

sup  K{9)  <  sup  K{p)  :=  Rn(p) 

StB*  PiPm 
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where  P  denotes  the  maximum  likelihood  estimator  of  the  transition  probabilities 
FT(So  I  y~U-  As  usual  r  <ienotes  the  cardinality  of  the  set  of  values  of  Yt. 

Lemma  4.3.2 

Rn{L)  =  rm  0a.3.  (log  log  n)  +  £(1  —pj  n  Pm 
Proof:  Follows  from  Lemma  4.3.1  and  Theorem  3.3.2. 

□ 

Lemma  4.3.2  can  be  used  to  estimate  the  rate  of  growth  of  Rn(k )  or  the  rate  of 
convergence  to  zero  of  X-Rn(k)-  Ideally  we  would  like  to  prove  that  X-Rn{0n)  = 
0a  s  (losio^n)  but  the  bound  in  4.3.2  is  too  weak  for  this,  nevertheless  we  have: 


Lemma  4.3.3 


for  all 


n  n 


,  _  (g-l)-52 

I°gr  (g-l)+T^ 

ot  <  ocm  7 

loSr  “  1 


Proof:  From  Lemma  4.3.2  we  have  for  all  n  large  and  some  finite  C: 

r?° 

Tla  1  i  nm _ _ _ )  _p  X 


-Rn(en)<C(rmn°-x+pm- 


loglog  . log  log  n 

The  idea  is  to  choose  m  as  a  function  of  n  in  such  a  way  that  the  right  hand  side 

remains  bounded  for  n  —*  oo. 

Clearly  rmna'x  =  1(V  n)  if  we  choose  m  =  (1  -  a)logrn.  (Since  m  must 
be  positive  for  all  n  we  get  a  <  1).  Adopting  this  value  for  m  the  second  term 

becomes: 

na  n(l— o,)logr  pna 


log  log  n 


log  log  n 
na+(l-a)l°g  rp 


log  log  n 

This  term  remains  bounded  in  n  (as  n  — *  oo)  if.  a  +  (1  a)  logr  P  _  0 
n  /  «  <  lo^p  -  <  1  Using  the  expression  for  p  obtained  in  A.0.2  we  complete  the 

u  —  logrp-l 

proof. 
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4.4  Information  theoretic  approach 

In  this  section  we  use  a  result  from  Information  Theory  to  get  a  useful  bound 
on  the  MLR  valid  for  all  values  of  q.  Recall  that  by  fj,(„)(!/ i )  've  “°  e  ° 
maximized  probability  ft(tf )  for  P,  a  HMC  with  We  denote  by  «t,(  i  ) 

the  corresponding  maxinnzed  probability  when  M,.  The  next  lemma  »  crucal. 
A  complete  proof  is  to  be  found  in  Csiszar  [7],  In  this  sect.on  log  denotes  log,. 


Lemma 


na  4.4.1  There  exists  a  probability  measure  Q  on  y°°  such  that 


log 


PMLq{Vi)  <  log  n  -  c  for  all  n  and  y* 


Q{V\) 


where  c  is  a  constant  and  d(q)  <?(<?  +  r  2) 

Sketch  of  the  proof: 

First  we  observe  that: 


PmlM)  = 

<  5"  maxPafy)  |  i?)  -maxPs(x1) 


(4.1) 


The  proof  proceeds  by  showing  the  existence  of  probability  measures  Qr  and  Qi 
such  that: 


■!*  I  T?ln?(r-1)/2 


maxP.W  |  *1)  <  <3i(v"  I1?)" 

6 


maxPs(x")  S  Qr(X;  in7'1 


(4.2) 


(4.3) 


Clearly  Q(yi)  :=  £,f  OiW  1  is  1  Prebab'bty  me“U”  ®  * 

substituting  into  (4.1)  completes  the  proof.  The  existence  of  Qr  and  ft  proved 

directly  by  actually  constructing  measures  Qi  and  Q2  that  sa  is  y  (  -) 
respectively. 


,  j  T  pmma  4  4  1  will  be  essential  to  finding  estima- 
The  following  Theorem,  based  on  Lemma  4.4.  i, 

tors  of  the  order  that  avoid  overestimation. 
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Theorem  4.4.1 


ito(lcgn)-Mog-^-<^  +  2  a.s.P,0 


Proof:  Introducing  the  measure  Q  from  Lemma  4.4.1  we  have: 

.  Peq{n)(yi)  _  i  ^Mn)^l)  4.  jnfr 

log ixwT  ~  8  iKsfT  +  8  p* W)  _  _ 

We  multiply  by  (logn)"1  and  apply  the  inequality  limit!,,  +  bn)  <  lima.  +  hmf>„. 
The  first  term  is  evaluated  using  Lemma  4.4.1: 

hm(logn)  log  g^„y—  -  o 

To  evaluate  the  second  term  define: 


Clearly: 


/!„:=  Ml  (logn)-Mog|i|l>2} 


K  ■■=  W;  Q(v")  >  <PPM)} 


It  follows  that: 

p,.(A U  =  E  PM)  2  E  -  h 

y  ”tAn  y  1*A„ 

Thus  En  Pe0(A„)  <  00  and  from  tlie  easy  direction  of  the  Borel-Cantelh  lemma  we 
conclude  that  Pe0{An  i.o.)  =  0.  This  is  equivalent  to  lim  (logn)"1  log  <  2 

a 


4.5  Compensators  avoiding  overestimation 

We  are  finally  able  to  give  a  set  of  sufficient  conditions  on  the  compensators  of  the 
maximized  likelihood  (the  sequences  Sn(q))  to  avoid  overestimation  of  the  order. 
Theorem  4.5.1  is  complementary  to  Theorem  4.1.1,  together  they  allow  us  to  con¬ 
struct  compensators  Sn(q)  that  guarantee  strong  consistency  of  the  order  estimator 
q(n).  Theorem  4.5.1  is  the  analog  of  Theorem  3.4.3  and  should  be  compared  with 

it. 
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Theorem  4.5.1  Compensators  avoiding  overestimation 

Let  Y  be  a  process  satisfying  assumptions  SP'  ani  PH.  If  the  compensator  is  of 
the  form: 


6n{q)  '•=  s*(n)%) 


where  the  function  satisfies. 


Um  ^(n)  >  1 


and  the  function  h  satisfies: 


Then: 


h(q')-h(q)>^  +  Z  V<Z'XZ>1 


Um  q{n)  <  ?o  °"s-  -^o 


Proof:  Let  q  >  qo-  From  the  definitions  we  have: 

...  i,  PM)  . 


C(qo,n)  -C(?,n)  =  1  log  +  f(")(%o)  %)) 


Therefore: 


[C(qo,n)  ~C{q,n)] 


s  ‘  Pijrt)  v  "  > 

The  first  term  on  the  RHS  can  be  bounded  using  Theorem  4.4.1  as: 

—  Aogny1  (1.  <dM+2 

lim  (  ~ j  ln  S  PSoWj  n  P,M)J  2 

This  follows  from  the  fact  that  £  '°S  pl]ri  (»f)  =  0**t  *«'  )  as  a  consea.ue 
Theorem  4.2.1  and  assumption  PH.  On  the  second  term  on  the  BBS  we  use  the 

hypothesis  on  the  h  function  {q  >  qo)  to  get: 

En  (SSI)-1  (C(®,n)  -  C(q. n)l  <  +  2)[1  " 
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The  hypothesis  on  Lp(-)  now  gives: 

I[C(4„,n)-C(?,n)]<0 

On  the  other  hand  [C(®,»)  -  C(«(n),n)]  >  0  by  definition  of  q(n).  We  conclude 
that  lim  q(n)  <  qo 

□ 

The  existence  of  a  strongly  consistent  estimator  q(n)  of  the  order  q0  will  be 
established  by  giving  examples  of  functions  h(-)  and  y>(-)  satisfying  both  the  con¬ 
ditions  imposed  by  Theorem  4.1.1  and  Theorem  4.5.1. 

Theorem  4.5.2  The  compensator: 

6n{q)  :=  2 

produces  a  strongly  consistent  estimator  q{n)  of  q0. 

Proof:  Clearly  lim$»($)  =  0  \/q  thus  satisfying  the  conditions  of  Theorem  4.1.1. 
The  function  V(n)  :=  2^  is  such  that  hm  (^)'V(n)  =  2  >  1  and  therefore 
satisfies  the  condition  imposed  by  Theorem  4.5.1.  For  the  function  h(q)  .-  <P{q) 
we  must  check  the  condition: 

Recall  that  d(q)  :=  q(q  +  r  -  2).  The  condition  to  be  verified  is  equivalent  to: 

q(q  +  r-  2)[q(q  +  r  -  2)  -  |]  >  «J(«  +  r  -  2)2  +  2 

for  all  q  >  q  >  1.  This  is  easily  estabhshed  observing  that  the  LHS  is  increasing 
in  q  and  that  for  q  =  q  +  1  the  inequality  is  verified. 

□ 
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Appendix  A 


We  coEect  here  some  basic  inequalities  for  HMC's  found  in  Baum-Petrie  [4]  and 
often  used  in  the  text.  The  proofs  of  these  results  imitate  closely  analogous  resu  ts 

for  Markov  chains  given  by  Doob  [9]. 

Lemma  A. 0.1 

Let  Yt  be  a  HMC  with  p.d.f.  Pe  where  9eQsq  then 

Ps(Xt+ 1  =  j\  Xt=i,  Ytk,tkeT)  >  ps 

where  ps  =  (1  +  2ii)~1  is  in dependent  of9,T,Ytk,t,j. 

Proof:  Let  j  and  j'  be  elements  of  X  and  suppose  that  tk  ^  t  +  1  f°r 
P»(Xt+i=j\Xt  =  hYtk) 

Pg^Xt+l  —  j'  I  Xt  =  h  Ytk) 

Pg(Xt+ 1  =  j,Xt  =  hYtk) 

Pg(Xt+ 1  =  j1  tXt  =  ii  Ytk) 

P,(XM=j,Ytktk>t  +  l\Xt  =  h)_ 

-  P«(Xt+i  =  j’,  Ytktk  >  i  +  1  |  Xt  =  «, )  .. 

E;o  PsjXt+1  =  *«+*  =  hYktk  >t  +  l\Xt-%l 

-  E;o  ®  =  jo,  XZl  =  j'»  Y tjk  >  t  +  1  I  At  -  l) 
jv  P(y«fc*fc  >  t  +  2  1  Xt+2  =  jo)ajjoaii 

=  Eio  >  *  +  2  I  Xt+2  =  jo)a:’joaij' 

Let  ctj0  :=  P(Ytktk  >t  +  2\  Xt+ 2  =  jo).  The  last  expression  is: 

an  Eio  Qjo giio  _ 
aij>  Ej'o  aioai'io 
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Since  atJ  >6V»,j  we  have 


12 in  aioaik 
Hjo  aioai'io 


E  j0  a;o  ai'io  a  , 

__ _ iia.  <  max 

Eio  <*io«i'io  J0 


Therefore: 


a;,* 

(*)  <  — -max 

'  '  n  -  . .  in 


uiio 

^ai'io 


<  max 


a»j  ajjo 


iJJ'Jo  \&ij'aj'jo  . 


since  all  elements  are  ^  S. 

_  p  /  y  _  |  X  =  t,  YtJkeT)  then  1  =  Pj  +  Ej'#j  Pi'  - 

Let  now  pj  :=  -  J  1  ’  tfc  *  ’ 

Pi  +  (9  +  l)f*  i-e-  Pi>(1  +  2^)"1 

K  tfe  =  t  +  1  for  some  fc  the  proof  needs  a  minor  modification. 


To  introduce  the  next  Lemma  we  need  to  introduce  some  notation.  C,  denotes 
a  cylinder  set  in  i.e. 

C,  :=  {X„  =  ii,  X*  =  is,  •  ■  •  X,,  =  in  vter*  *k>t  k  =  1, 2,  ■  •  •  n) 

D  denotes  a  cylinder  set  in  TT  i-e- 

D  ~  {Ytl  =  yi, Yt2=y2,--- Ytm  =  2/m  wAcrc  are  ar?ntrar^ 


Mfl+(d,  Ct,  D)  :=  mp  P«(Ct  i  *t-<*  -  *»  -0) 

C«,  D)  :=  mjn P*(C*  |  Xt-d  =  *» 

Lemma  A. 0.2 

Mt{d,CuD)  -  Mf(d,CuD)  <  pd~l 


where  p  —  1  —  2 ps 

Proof:  (Follows  closely  the  proof  of  an  analogous  result  for  Markov  chains  given 
in  Lamperti  [15]).  We  simplify  notation  to  Jtf,  moreover  let  7, 

X,.,  =  k,D),  A»  :=  =  *  I  -  {'D)  “d  defi“e  ° 

Mt+1  =  P*(Ct  |  X,-,t-i  =  *o,  D)  and  Md  =  7fco- 
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With  the  new  notation  we  have: 


Af+i  =  Pe{Xt-d  =  k  |  =  *o, D)  =  £ 

=  tlMj  +  (A'oJfeo  ~  S 

k^ko 

<  nMj  +  (Aofc  -  /X  +  Yj  lk)Mj  =  pMj  +  (1  - 

k^k0 


Similary  we  get: 

MJ+1  >  (1  -  p)Mj+1  +  pMj-+1 


Together  the  last  two  inequalities  give: 


M/+1  -  MJ+1  <  (1  -  2/i)(M,+  -  Md  ) 


The  result  follows  immediately 


□ 


Lemma  A. 0.3 

\P>(C,\Yi)-P»(C,\Y^)  | 

/or  a//  k  and  all  n  <  t  —  1 
Proof: 

P*(Ct  I  ^n)  =  E^<  I  ^n^n-l  =  j)P*(*n-l  =  j  I  ^n) 

J 

and  therefore: 

M,--n+,  <  r,(C,  |  y?)  <  M+„+1 

On  the  other  hand 

Pe(Ct  I  Ynk+l)  =  J2Pe(Ct  I  Yn)Ps(Yn  I  Yn+ 1) 

yn 

is  an  average  of  the  P«(V'„  |  r‘+1)  probabilities  and  therefore: 

Af,-„+1  <  r»(c,  |  n*+i)  s  JV,t„+1 

The  result  now  follows  from  Lemma  A.0.2 


□ 
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